{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "d6d61b17-2254-4fba-9491-3a5f2b426711",
   "metadata": {},
   "source": [
    "# 赛事背景\n",
    "\n",
    "人民对于医疗健康的需求在不断增长，但社会现阶段医疗资源紧缺，往往排队一上午看病十分钟，时间和精神成本巨大。如何更好地优化医疗资源配置，找到合适的方向，进行分级诊疗，是当前社会的重要课题。\n",
    "\n",
    "大众自觉身体状态异常，有时不能准确判断自己是否患有疾病，需要寻求有专业知识的人进行判断，但是主诉者一般进行口语化表述，不容易进行精准高效的指引。\n",
    "\n",
    "# 赛事任务\n",
    "\n",
    "进行简单分诊需要一定的数据和经验知识进行支撑。本次比赛提供了部分好大夫在线的真实问诊数据，经过严格脱敏，提供给参赛者进行单分类任务。具体为：通过处理文字诉求，给出20个常见的就诊方向之一和61个疾病方向之一。\n",
    "\n",
    "# 评审规则\n",
    "\n",
    "1. 数据说明\n",
    "\n",
    "比赛共提供约22800条训练数据，约7600条测试数据。单条数据包含年龄段age、主诉diseaseName、标题title、希望获得的帮助hopeHelp和其他描述字段文本conditionDesc，以及就诊方向标签label_i∈int [0,19]，疾病方向标签label_j∈int [0,60] (训练数据中疾病方向标签存在缺失，以-1标记，测试数据中没有)。\n",
    "\n",
    "其中具体含义如下表：\n",
    "\n",
    "| label_i | 就诊方向         |\n",
    "| ------- | ---------------- |\n",
    "| 0       | 乳腺外科         |\n",
    "| 1       | 产前检查         |\n",
    "| 2       | 内科             |\n",
    "| 3       | 呼吸内科         |\n",
    "| 4       | 咽喉疾病         |\n",
    "| 5       | 妇产科           |\n",
    "| 6       | 小儿保健         |\n",
    "| 7       | 小儿呼吸系统疾病 |\n",
    "| 8       | 小儿消化疾病     |\n",
    "| 9       | 小儿耳鼻喉       |\n",
    "| 10      | 心内科           |\n",
    "| 11      | 消化内科         |\n",
    "| 12      | 甲状腺疾病       |\n",
    "| 13      | 皮肤科           |\n",
    "| 14      | 直肠肛管疾病     |\n",
    "| 15      | 眼科             |\n",
    "| 16      | 神经内科         |\n",
    "| 17      | 脊柱退行性变     |\n",
    "| 18      | 运动医学         |\n",
    "| 19      | 骨科             |\n",
    "\n",
    "| label_j | 疾病方向         | label_j | 疾病方向       | label_j | 疾病方向         |\n",
    "| ------- | ---------------- | ------- | -------------- | ------- | ---------------- |\n",
    "| 0       | 乳房囊肿         | 21      | 小儿支气管肺炎 | 42      | 皮肤瘙痒         |\n",
    "| 1       | 乳腺增生         | 22      | 小儿消化不良   | 43      | 皮肤科其他       |\n",
    "| 2       | 乳腺疾病         | 23      | 小儿消化疾病   | 44      | 直肠肛管疾病     |\n",
    "| 3       | 乳腺肿瘤         | 24      | 小儿耳鼻喉其他 | 45      | 眼部疾病         |\n",
    "| 4       | 产前检查         | 25      | 小儿肺炎       | 46      | 神经内科其他     |\n",
    "| 5       | 儿童保健         | 26      | 心内科其他     | 47      | 微量元素缺乏     |\n",
    "| 6       | 先兆流产         | 27      | 心脏病         | 48      | 羊水异常         |\n",
    "| 7       | 内科其他         | 28      | 扁桃体炎       | 49      | 肺部疾病         |\n",
    "| 8       | 剖腹产           | 29      | 早孕反应       | 50      | 胃病             |\n",
    "| 9       | 发育迟缓         | 30      | 月经失调       | 51      | 脊柱退行性变     |\n",
    "| 10      | 呼吸内科其他     | 31      | 桥本甲状腺炎   | 52      | 腰椎间盘突出     |\n",
    "| 11      | 咽喉疾病         | 32      | 消化不良       | 53      | 腹泻             |\n",
    "| 12      | 喉疾病           | 33      | 消化内科其他   | 54      | 腹痛             |\n",
    "| 13      | 围产保健         | 34      | 消化道出血     | 55      | 膝关节半月板损伤 |\n",
    "| 14      | 外阴疾病         | 35      | 甲减           | 56      | 膝关节损伤       |\n",
    "| 15      | 妇科病           | 36      | 甲状腺功能异常 | 57      | 膝关节韧带损伤   |\n",
    "| 16      | 宫腔镜           | 37      | 甲状腺疾病     | 58      | 运动医学         |\n",
    "| 17      | 小儿呼吸系统疾病 | 38      | 甲状腺瘤       | 59      | 韧带损伤         |\n",
    "| 18      | 小儿咳嗽         | 39      | 甲状腺结节     | 60      | 骨科其他         |\n",
    "| 19      | 小儿感冒         | 40      | 痔疮           |         |                  |\n",
    "| 20      | 小儿支气管炎     | 41      | 皮肤病         |         |                  |\n",
    "\n",
    "| id   | age  | diseaseName                     | conditionDesc                                                | title                        | hopeHelp                                                   | label_i | label_j |\n",
    "| ---- | ---- | ------------------------------- | ------------------------------------------------------------ | ---------------------------- | ---------------------------------------------------------- | ------- | ------- |\n",
    "| 1    | 30+  | 小红点是什么？                  | 四肢上部张图片中这样的小红疙瘩是怎么回事呢？特别痒。         | 小红点是什么？               | 请医生给我一些治疗上的建议                                 | 13      | 42      |\n",
    "| 2    | 30+  | 乳腺结节                        | 体检发现左侧乳腺有结节，13mm×8mm,自己没有任何症状            | 左侧乳腺结节                 | 请医生给我一些治疗上的建议,目前病情是否需要手术？          | 0       | 1       |\n",
    "| 3    | 20+  | 身体麻 身体坐左半面肢体有麻木感 | 年初患有带状疱疹 在左腿脚踝上方 之后疱疹好了 但是左脚开始麻 随后左手麻 然后左腿和左胳膊陆续开始麻 直到左边脸也略有麻木感 同时右脚也麻 | 左侧身体麻木                 | 什么原因导致的麻木以及如何治疗                             | 16      | 47      |\n",
    "| 4    | 10+  | 生长缓慢，想再增长10厘米        | 今年8月满15岁，身高165厘米，近半年生长缓慢，能否打生长激素再增高10厘米。 | 想增高10厘米                 | 是否可打生长激素                                           | 6       | 9       |\n",
    "| 5    | 30+  | 眼睛看东西突然变小，变远        | 从小记事时就有此病，发病时看任何东西都变小，距离也判断不太好了，没有就过医。 | 眼睛看东西突然变小，从小就有 | 希望医生诊断这是什么病？需要就医吗？有什么注意事项？谢谢。 | 15      | 46      |\n",
    "| 6    | 20+  | 甲状腺                          | 嗓子疼，说话感觉里面有回音，先是由感冒引起，后来大量运动后天气变化着凉，疼痛加剧 | 甲状腺肿大                   | 如何控制病情，是否还需做下一步检查                         | 12      | 36      |\n",
    "| 7    | 0+   | 晚上磨牙，缺钙                  | 女,5岁4个月。骨密度测试部缺钙，验血又显示缺钙，不知道到底怎么回事 | 孩子是否缺钙？               | 希望医生解答一下孩子是否缺钙                               | 6       | 48      |\n",
    "| 8    | 50+  | 手指麻木                        | 一月前出现十个手指指尖麻木，现症状加重。                     | 手指麻木                     | 会是什么病？需要做什么检查？                               | 16      | 47      |\n",
    "| 9    | 20+  | 便血，鲜红色的                  | 便血，鲜红色的，大便时还有点疼，就持续两天                   | 要不要做检查                 | 要不要做检查                                               | 11      | 34      |\n",
    "\n",
    "另提供14000条相关知识文本，每条包含主体/（属性）/客体，供选手随意选用。\n",
    "\n",
    "## 评估指标\n",
    "\n",
    "本模型依据提交的结果文件，采用两种标签类别求和评价，就诊方向label_i用F1-macro score进行评价，疾病方向label_j用F1-micro score进行评价, 最终结果取两者之和。\n",
    "\n",
    "## 评测及排行\n",
    "\n",
    "1、比赛提供下载数据，选手在本地进行算法调试，在比赛页面提交结果。\n",
    "\n",
    "2、每支团队每天最多提交3次。\n",
    "\n",
    "3、排行按照得分从高到低排序，排行榜将选择团队的历史最优成绩进行排名。\n",
    "\n",
    "4、要求最终预测两种标签使用的模型参数总量小于4e8(含模型融合)，前十名需提交代码和相关说明以备核验。\n",
    "\n",
    "# 作品提交要求\n",
    "\n",
    "1、文件格式：按照csv格式提交\n",
    "\n",
    "2、文件大小：无要求\n",
    "\n",
    "3、提交次数限制：每支队伍每天最多3次\n",
    "\n",
    "4、文件详细说明：\n",
    "\n",
    "\\1) 编码为UTF-8，第一行为表头，包含id、就诊方向label_i和疾病方向label_j\n",
    "\n",
    "\\2) 提交格式见样例\n",
    "\n",
    "# 赛程规划\n",
    "\n",
    "本赛题实行一轮赛制\n",
    "\n",
    "## 赛程周期 6月9日-7月9日\n",
    "\n",
    "1、截止成绩以团队在比赛时间段内最优成绩为准\n",
    "\n",
    "2、比赛作品提交截止日期为7月9日17:00\n",
    "\n",
    "## 现场答辩\n",
    "\n",
    "1、最终前三名团队将受邀参加科大讯飞全球1024开发者节并于现场进行答辩\n",
    "\n",
    "2、答辩以（10mins陈述+5mins问答）的形式进行\n",
    "\n",
    "3、根据作品成绩和答辩成绩综合评分（作品成绩占比70％，现场答辩分数占比30％）"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "43e82628-0e0e-4e14-9451-8b60c535b1ca",
   "metadata": {},
   "source": [
    "# 数据读取"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "4ab9e31c-a981-495c-8b99-5992bd6bae53",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-06-14T09:59:17.327415Z",
     "iopub.status.busy": "2022-06-14T09:59:17.326905Z",
     "iopub.status.idle": "2022-06-14T09:59:17.930472Z",
     "shell.execute_reply": "2022-06-14T09:59:17.929959Z",
     "shell.execute_reply.started": "2022-06-14T09:59:17.327356Z"
    },
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Populating the interactive namespace from numpy and matplotlib\n"
     ]
    }
   ],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "import seaborn as sns\n",
    "\n",
    "%pylab inline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "adcba03d-71a7-45bc-a746-700ad3ce2132",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-06-14T09:59:17.931690Z",
     "iopub.status.busy": "2022-06-14T09:59:17.931482Z",
     "iopub.status.idle": "2022-06-14T09:59:20.295622Z",
     "shell.execute_reply": "2022-06-14T09:59:20.294704Z",
     "shell.execute_reply.started": "2022-06-14T09:59:17.931673Z"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "train_df = pd.read_excel('好大夫-非标准化疾病诉求的简单分诊挑战赛2.0公开数据/data_train.xlsx')\n",
    "test_df = pd.read_excel('好大夫-非标准化疾病诉求的简单分诊挑战赛2.0公开数据/data_test.xlsx')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "8cd234df-b6eb-404a-8601-01db90f4ae55",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-06-14T09:59:20.298395Z",
     "iopub.status.busy": "2022-06-14T09:59:20.298235Z",
     "iopub.status.idle": "2022-06-14T09:59:20.302622Z",
     "shell.execute_reply": "2022-06-14T09:59:20.302188Z",
     "shell.execute_reply.started": "2022-06-14T09:59:20.298380Z"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "test_submit = pd.read_csv('好大夫-非标准化疾病诉求的简单分诊挑战赛2.0公开数据/提交示例.csv')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "2d11fb25-6aa5-4811-b85b-a2741a2a209e",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-06-14T09:59:20.307436Z",
     "iopub.status.busy": "2022-06-14T09:59:20.307278Z",
     "iopub.status.idle": "2022-06-14T09:59:20.375291Z",
     "shell.execute_reply": "2022-06-14T09:59:20.374850Z",
     "shell.execute_reply.started": "2022-06-14T09:59:20.307421Z"
    },
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>age</th>\n",
       "      <th>diseaseName</th>\n",
       "      <th>conditionDesc</th>\n",
       "      <th>title</th>\n",
       "      <th>hopeHelp</th>\n",
       "      <th>label_i</th>\n",
       "      <th>label_j</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>20+</td>\n",
       "      <td>刨腹产可以养三胎吗？</td>\n",
       "      <td>刨腹产可以养三胎吗？希望医生能帮我看看，谢谢！</td>\n",
       "      <td>刨腹产可以养三胎吗？</td>\n",
       "      <td>前两个都是刨腹产，时间已经隔了有6年了，前两个都很顺利，想问问医生现在可以养三胎吗？</td>\n",
       "      <td>1</td>\n",
       "      <td>8</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>20+</td>\n",
       "      <td>右膝前交叉韧带断裂，半月板三度重度损伤</td>\n",
       "      <td>右膝受力小腿骨和大腿骨就会滑开，走路会有隐隐作痛</td>\n",
       "      <td>大概做韧带重建手术需要多少费用？</td>\n",
       "      <td>大概做韧带重建手术需要多少费用？</td>\n",
       "      <td>18</td>\n",
       "      <td>57</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>30+</td>\n",
       "      <td>没有不适，就是左眼球有红色充血</td>\n",
       "      <td>昨天我朋友跟我说，我的左眼睛有红色的问我怎么啦，我自己都不知道，所以现在问医生看需要去医院吗...</td>\n",
       "      <td>需要怎么治疗</td>\n",
       "      <td>需要怎么治疗</td>\n",
       "      <td>15</td>\n",
       "      <td>45</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3</td>\n",
       "      <td>30+</td>\n",
       "      <td>腿摔了</td>\n",
       "      <td>一个月前从摩托上摔下来了，拍了片子和核磁共振，打的石膏，腿直不了</td>\n",
       "      <td>腿可能这辈子就瘸了</td>\n",
       "      <td>请医生给我一些治疗上的建议，目前病情是否需要手术？，是否需要就诊？就诊前做哪些准备？</td>\n",
       "      <td>18</td>\n",
       "      <td>58</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4</td>\n",
       "      <td>30+</td>\n",
       "      <td>体检单问题</td>\n",
       "      <td>偶尔会拉肚子或便秘 胃肠偶尔不舒服 没有疼痛感</td>\n",
       "      <td>体检单问题咨询 需要治疗吗</td>\n",
       "      <td>单位体检遇到的问题</td>\n",
       "      <td>2</td>\n",
       "      <td>-1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   id  age          diseaseName  \\\n",
       "0   0  20+           刨腹产可以养三胎吗？   \n",
       "1   1  20+  右膝前交叉韧带断裂，半月板三度重度损伤   \n",
       "2   2  30+      没有不适，就是左眼球有红色充血   \n",
       "3   3  30+                  腿摔了   \n",
       "4   4  30+                体检单问题   \n",
       "\n",
       "                                       conditionDesc             title  \\\n",
       "0                            刨腹产可以养三胎吗？希望医生能帮我看看，谢谢！        刨腹产可以养三胎吗？   \n",
       "1                           右膝受力小腿骨和大腿骨就会滑开，走路会有隐隐作痛  大概做韧带重建手术需要多少费用？   \n",
       "2  昨天我朋友跟我说，我的左眼睛有红色的问我怎么啦，我自己都不知道，所以现在问医生看需要去医院吗...           需要怎么治疗　   \n",
       "3                   一个月前从摩托上摔下来了，拍了片子和核磁共振，打的石膏，腿直不了         腿可能这辈子就瘸了   \n",
       "4                            偶尔会拉肚子或便秘 胃肠偶尔不舒服 没有疼痛感     体检单问题咨询 需要治疗吗   \n",
       "\n",
       "                                     hopeHelp  label_i  label_j  \n",
       "0  前两个都是刨腹产，时间已经隔了有6年了，前两个都很顺利，想问问医生现在可以养三胎吗？        1        8  \n",
       "1                            大概做韧带重建手术需要多少费用？       18       57  \n",
       "2                                      需要怎么治疗       15       45  \n",
       "3  请医生给我一些治疗上的建议，目前病情是否需要手术？，是否需要就诊？就诊前做哪些准备？       18       58  \n",
       "4                                   单位体检遇到的问题        2       -1  "
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "train_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c9f77909-9b49-4451-9e98-0044d9a4cbd5",
   "metadata": {},
   "source": [
    "# 数据分析"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "41b49366-9b3a-4856-bac5-61c76b540516",
   "metadata": {},
   "source": [
    "## 类别分布"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "1b46c0e3-5060-4096-8f4b-9ad4673db8a3",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-06-14T09:59:20.376203Z",
     "iopub.status.busy": "2022-06-14T09:59:20.375981Z",
     "iopub.status.idle": "2022-06-14T09:59:20.603833Z",
     "shell.execute_reply": "2022-06-14T09:59:20.603368Z",
     "shell.execute_reply.started": "2022-06-14T09:59:20.376188Z"
    },
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<AxesSubplot:>"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXAAAAD4CAYAAAD1jb0+AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuNCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8QVMy6AAAACXBIWXMAAAsTAAALEwEAmpwYAAAVGklEQVR4nO3df5BlZX3n8ffHGUEh8ssBMwK7gy5QsciCZEJBDIbfIYSV1ZgsrNmAmmVXV6OuqwWyFZLdypYas9GtZCVER3RD8AcBYxENsK5KpcqgM4SfAoKKOAgMrAv+qojAd/84Z6BpuvvePufc232d96uqq+8957n3fOeZvt8+/Zzne55UFZKk2fOMlQ5AktSNCVySZpQJXJJmlAlckmaUCVySZtTaaR5s3bp1tWHDhmkeUpJm3pYtWx6sqr3nb59qAt+wYQObN2+e5iElaeYl+eZC2x1CkaQZNfIMPMkm4FRgW1Ud0m47FLgA+CngLuBVVfXdUe910z0Ps+Gcv+kVsLq7652/utIhSBrQOGfgFwEnz9v2AeCcqvpZ4HLgbQPHJUkaYWQCr6prgO/M23wQcE37+Grg1waOS5I0Qtcx8FuA09rHvw7sv1jDJGcn2Zxk82M/fLjj4SRJ83VN4K8BXp9kC/Ac4JHFGlbVhVW1sao2rtll946HkyTN12kaYVXdBpwEkOQgwKtjkjRlnRJ4kn2qaluSZwD/mWZGykg/u+/ubHYmhCQNYuQQSpJLgC8CByfZmuS1wBlJvgrcBnwb+NBkw5QkzTfyDLyqzlhk1/sGjkWStAxWYkrSjDKBS9KM6lpK/4fAv6CZPvg14NVV9dCo97KUXuOw5F8aT9dS+quBQ6rqnwNfBc4dOC5J0gidSumr6qqqerR9+vfAfhOITZK0hCHGwF8DfGaxnZbSS9Jk9ErgSc4DHgUuXqyNpfSSNBmdV+RJchbNxc3jq6rGeY2VmJI0nK6l9CcDbwd+qap+OGxIkqRxdC2l/xOauxBeneT6JGPdC0WSNJyupfQfnEAskqRlsBJTkmZU31koJye5PcmdSc4ZKihJ0mh9ZqGsAf4UOBHYCnw5yaeq6iuLvcZSeu0ovB2ApqHPGfgRwJ1V9fWqegT4KE+ukylJmrA+CXxf4Ftznm9tt0mSpmDiFzEtpZekyeiTwO8B9p/zfL9221NYSi9Jk9H5IibwZeDAJAfQJO7TgX+91AsspZek4XRO4FX1aJI3AFcCa4BNVXXLYJFJkpbU5wycqvo08OmBYpEkLYOVmJI0o0zgkjSjxrkb4aYk25LcPG/7G5PcluSWJO+eXIiSpIWMMwZ+Ec3tYz+yfUOSY2mqLg+tqh8l2Wecg1lKrx2Z5fUaWqdFjYHXAe+sqh+1bbZNIDZJ0hK6joEfBByd5NokX0jy84s1tBJTkiajawJfC+wFHAm8Dfh4kizU0EpMSZqMrgl8K3BZNb4EPA6sGy4sSdIoXQt5PgkcC3wuyUHATsCDo15kKb0kDWdkAm8XNT4GWJdkK3A+sAnY1E4tfAQ4s6pqkoFKkp6q66LGAL85cCySpGWwElOSZpQJXJJmVKdS+iQfS3J9+3VXkusnGqUk6Wk6ldJX1b/a/jjJHwFjVehYSq8dneX0GtI4FzGvSbJhoX1t8c5vAMcNHJckaYS+Y+BHA/dX1R2LNbCUXpImo28CPwO4ZKkGltJL0mR0XlItyVrgFcDPDReOJGlcfdbEPAG4raq2jvsCS+klaTjjTCO8BPgicHCSrUle2+46nRHDJ5KkyelcSl9VZw0ejSRpbFZiStKMMoFL0ozqMwvlWcA1wM7t+1xaVecv9RorMaWFWaGpLvrMQvkRcFxVfT/JM4G/S/KZqvr7gWKTJC2hcwJvF3D4fvv0me2XizpI0pT0GgNPsqa9E+E24OqqunaBNpbSS9IE9ErgVfVYVR0G7AcckeSQBdpYSi9JEzDILJSqegj4HHDyEO8nSRqtzyyUvYEfV9VDSZ4NnAi8a6nXWEovScPpMwtlPfDhJGtozuQ/XlVXDBOWJGmUPrNQbgRePGAskqRlsBJTkmaUCVySZtTIIZQkm4BTgW1VdUi77TDgAuBZwKPA66vqS6Pey1J6aWmW1Gs5xjkDv4inTw98N/D77Rzw322fS5KmaGQCr6prgO/M3wzs1j7eHfj2wHFJkkboOgvlzcCVSd5D80vgFxZrmORs4GyANbvt3fFwkqT5ul7EfB3wlqraH3gL8MHFGlpKL0mT0TWBnwlc1j7+BHDEMOFIksbVdQjl28AvAZ8HjgPuGOdFltJL0nDGmUZ4CXAMsC7JVuB84N8C70uyFvhH2jFuSdL0dF6VHvi5gWORJC2DlZiSNKNM4JI0o7qW0v8ezTj4A22zd1TVp0e9l6X00ngsqdc4upbSA/xxVR3Wfo1M3pKkYXUtpZckrbA+Y+BvSHJjkk1J9lyskavSS9JkdE3g7wdeCBwG3Av80WINLaWXpMnolMCr6v6qeqyqHgf+HEvpJWnqOpXSJ1lfVfe2T18O3DzO6yyll6ThdC2lP6ZdlaeAu4B/N7kQJUkL6VpKv+jtYyVJ02ElpiTNKBO4JM2orvcDByDJm2hK6gP8eVW9d6n2ltJLy2NJvZbS+Qw8ySE0yfsI4FDg1CT/bKjAJElL6zOE8jPAtVX1w6p6FPgC8IphwpIkjdIngd8MHJ3kuUl2AU4B9p/fyFJ6SZqMzmPgVXVrkncBVwE/AK4HHlug3YXAhQA7rz+wuh5PkvRUqRompyb5b8DWqvqfi7XZuHFjbd68eZDjSdKOIsmWqto4f3vfWSj7VNW2JP+EZvz7yD7vJ0kaX68EDvxVkucCPwb+Q1U91D8kSdI4eiXwqjp6qEAkSctjJaYkzag+hTwHJ7l+ztd3k7x5wNgkSUvoM43wdpoVeUiyBrgHuHyp11hKL3VjSb0WMtQQyvHA16rqmwO9nyRphKES+OnAJQO9lyRpDL0TeJKdgJcBn1hkv6X0kjQBQ5yB/wpwXVXdv9BOV6WXpMnoW8gDcAZjDp+4qLEkDafXGXiSXYETgcuGCUeSNK6+lZg/AJ47UCySpGWwElOSZpQJXJJmVN8x8E1JtiW5eaiAJEnj6TsL5SLgT4CPjNPYUnqpH0vqNVevM/Cqugb4zkCxSJKWwTFwSZpRE0/gltJL0mRMPIFbSi9JkzFEKf3YLKWXpOH0nUZ4CfBF4OAkW5O8dpiwJEmj9C2lP2OoQCRJy+MsFEmaUSZwSZpRIxP4QuXySf5rkhvb1eivSvL8yYYpSZovVbV0g+SlwPeBj1TVIe223arqu+3j3wFeVFX/ftTBdl5/YK0/8729g5Z2ZJbT73iSbKmqjfO3jzwDX6hcfnvybu0KLP1bQJI0uM6zUJL8AfBbwMPAsUu0Oxs4G2DNbnt3PZwkaZ7OFzGr6ryq2h+4GHjDEu2sxJSkCRhiFsrFwK8N8D6SpGXoNISS5MCquqN9ehpw2zivs5RekoYzMoG35fLHAOuSbAXOB05JcjDwOPBNYOQMFEnSsEYm8EXK5T84gVgkSctgJaYkzSgTuCTNqE6l9HP2vTVJJVk3mfAkSYsZZxbKRSyw8nyS/YGTgLvHPZir0ksrz1L8nxydSulbfwy8HcvoJWlFdBoDT3IacE9V3TBGWxc1lqQJWHYhT5JdgHfQDJ+MVFUXAhdCczfC5R5PkrSwLmfgLwQOAG5IchewH3Bdkp8eMjBJ0tKWfQZeVTcB+2x/3ibxjVX14KjXWkovScMZZxqhK89L0irUtZR+7v4Ng0UjSRqblZiSNKNM4JI0o8a5newm4FRg25xFjX8d+D3gZ4AjqmrzOAezElNaeVZi/uQY5wz8IuDkedtuBl4BXDN0QJKk8YxzEfOaJBvmbbsVIMmEwpIkjTLxMXBL6SVpMiaewF2VXpImw1kokjSjOq1K35Wl9JI0nE6l9Ele3q5QfxTwN0munHSgkqSn6lNKf/nAsUiSlsExcEmaUSZwSZpRvS5itvcC/x7wGPBoVW1cqr2l9NLssxR/9RhiFsqx4yzmIEkalkMokjSj+ibwAq5KsiXJ2Qs1sJRekiaj7xDKL1bVPUn2Aa5OcltVPeUOha5KL0mT0esMvKruab9vo5kXfsQQQUmSRut8Bp5kV+AZVfW99vFJwH9Z6jWW0kvScPoMoTwPuLy9J/ha4C+r6m8HiUqSNFLnBF5VXwcOHTAWSdIyOI1QkmaUCVySZlTXVen3Aj4GbADuAn6jqv7fqPeylF7acVmCP7yuq9KfA3y2qg4EPts+lyRN0cgE3hbmfGfe5tOAD7ePPwz8y2HDkiSN0nUM/HlVdW/7+D6aKYULspRekiaj90XMqiqae6Istt9V6SVpArom8PuTrAdov28bLiRJ0ji6FvJ8CjgTeGf7/a/HeZGl9JI0nE6r0tMk7hOT3AGc0D6XJE1Rn1Xpjx84FknSMliJKUkzygQuSTOq76r0bwF+m2Ya4U3Aq6vqHxdrbym9tGOznH5Ync/Ak+wL/A6wsb1Hyhrg9KECkyQtre8Qylrg2UnWArsA3+4fkiRpHJ0TeLse5nuAu4F7gYer6qr57Syll6TJ6DOEsifNTa0OAJ4P7JrkN+e3s5Rekiajz0XME4BvVNUDAEkuA34B+IvFXmAlpiQNp88Y+N3AkUl2SbOy8fHArcOEJUkapc8Y+LXApcB1NFMInwFcOFBckqQRes0Dr6rzgfMHikWStAxWYkrSjOqdwJOsSfIPSa4YIiBJ0nh6DaG03kRz8XK3UQ0tpZc0n+X13fU6A0+yH/CrwAeGCUeSNK6+QyjvBd4OPN4/FEnScvSpxDwV2FZVW0a0s5Rekiagzxn4S4CXJbkL+ChwXJKnVWFaSi9Jk5Gq6v8myTHAf6qqU5dqt3Hjxtq8eXPv40nSjiTJlqraOH+788AlaUYNMY2Qqvo88Pkh3kuSNB7PwCVpRpnAJWlG9S3k2SPJpUluS3JrkqOGCkyStLS+Y+DvA/62ql6ZZCeadTEXZSm9pL4svX9S5wSeZHfgpcBZAFX1CPDIMGFJkkbpM4RyAPAA8KH2boQfSLLr/EZWYkrSZPRJ4GuBw4H3V9WLgR8A58xvZCWmJE1GnwS+FdjaLq0GzfJqh/cPSZI0js5j4FV1X5JvJTm4qm6nWdT4K0u9xlXpJWk4fWehvBG4uJ2B8nXg1f1DkiSNo++ixtcDT7vBiiRp8qzElKQZZQKXpBnVZ0We/ZN8LslXktyS5E1DBiZJWlqfMfBHgbdW1XVJngNsSXJ1VS06E8VSekk7okmV/3c+A6+qe6vquvbx94BbgX2HCkyStLRBxsCTbABeDFy7wD5L6SVpAnon8CQ/BfwV8Oaq+u78/ZbSS9Jk9L0f+DNpkvfFVXXZMCFJksbR53ayAT4I3FpV/32c11hKL0nD6XMG/hLg3wDHJbm+/TploLgkSSP0uZnV3wEZMBZJ0jKkqqZ3sOR7wO1TO+D41gEPrnQQ86zGmGB1xrUaYwLjWo7VGBOsnrj+aVXtPX9j37sRLtftVbXqbn6VZPNqi2s1xgSrM67VGBMY13Ksxphg9ca1nfdCkaQZZQKXpBk17QR+4ZSPN67VGNdqjAlWZ1yrMSYwruVYjTHB6o0LmPJFTEnScBxCkaQZZQKXpBk1lQSe5OQktye5M8k50zjmnGMvuPBEkr2SXJ3kjvb7nu32JPkfbaw3Jjl8grGtSfIPSa5onx+Q5Nr22B9rF4smyc7t8zvb/RsmGNMeSS5NcluSW5MctUr66i3t/9/NSS5J8qyV6K8km5JsS3LznG3L7p8kZ7bt70hy5gRi+sP2//DGJJcn2WPOvnPbmG5P8stztg/6OV0orjn73pqkkqxrn69YX7Xb39j21y1J3j1n+1T6qrOqmugXsAb4GvACYCfgBuBFkz7unOOvBw5vHz8H+CrwIuDdwDnt9nOAd7WPTwE+Q1NleiRw7QRj+4/AXwJXtM8/DpzePr4AeF37+PXABe3j04GPTTCmDwO/3T7eCdhjpfuK5j7z3wCePaefzlqJ/gJeChwO3Dxn27L6B9gL+Hr7fc/28Z4Dx3QSsLZ9/K45Mb2o/QzuDBzQfjbXTOJzulBc7fb9gSuBbwLrVkFfHQv8b2Dn9vk+0+6rzv+eiR8AjgKunPP8XODclfjHtsf/a+BEmorQ9e229TRFRgB/Bpwxp/0T7QaOYz/gs8BxwBXtD+6Dcz50T/Rb+8N+VPt4bdsuE4hpd5pEmXnbV7qv9gW+1X6I17b99csr1V/AhnkJYFn9A5wB/Nmc7U9pN0RM8/a9nOaOoU/7/G3vq0l9TheKC7gUOBS4iycT+Ir1Fc2JwAkLtJtqX3X5msYQyvYP33ZbWaGVe/LUhSeeV1X3trvuA57XPp5WvO8F3g483j5/LvBQVT26wHGfiKnd/3DbfmgHAA8AH2qHdj6QZFdWuK+q6h7gPcDdwL00//4trHx/bbfc/pn2Z+I1NGe3Kx5TktOAe6rqhnm7VjKug4Cj2+G2LyT5+VUQ01h2mIuYWWLhiWp+jU5tPmWSU4FtVbVlWscc01qaPy/fX1UvBn5AMyTwhGn3FUA7pnwazS+Y5wO7AidPM4ZxrUT/LCXJeTTr1168CmLZBXgH8LsrHcs8a2n+ujsSeBvw8SQzcaO+aSTwe2jGvLbbr902NVl44Yn7k6xv968HtrXbpxHvS4CXJbkL+CjNMMr7gD2SbL8/zdzjPhFTu3934P8OHBM0ZxJbq2r70niX0iT0lewrgBOAb1TVA1X1Y+Aymj5c6f7abrn9M5V+S3IWcCrwqvYXy0rH9EKaX8I3tD/7+wHXJfnpFY5rK3BZNb5E81fxuhWOaSzTSOBfBg5sZwzsRHNR6VNTOC6w5MITnwK2X9E+k2ZsfPv232qvih8JPDznz+NBVNW5VbVfVW2g6Y//U1WvAj4HvHKRmLbH+sq2/eBneVV1H/CtJAe3m44HvsIK9lXrbuDIJLu0/5/b41rR/ppjuf1zJXBSkj3bvy5OarcNJsnJNEN0L6uqH86L9fQ0M3UOAA4EvsQUPqdVdVNV7VNVG9qf/a00EwzuYwX7CvgkzYVMkhxEc2HyQVawr8Y2jYF2mivMX6W5cnveNAf5gV+k+ZP2RuD69usUmjHRzwJ30FyB3qttH+BP21hvAjZOOL5jeHIWygtofkDuBD7Bk1fFn9U+v7Pd/4IJxnMYsLntr0/SXPlf8b4Cfh+4DbgZ+F80MwOm3l/AJTTj8D+mSUCv7dI/NOPSd7Zfr55ATHfSjNNu/5m/YE7789qYbgd+Zc72QT+nC8U1b/9dPHkRcyX7aifgL9qfreuA46bdV12/LKWXpBm1w1zElKSfNCZwSZpRJnBJmlEmcEmaUSZwSZpRJnBJmlEmcEmaUf8fa+Y1605WJQ0AAAAASUVORK5CYII=\n",
      "text/plain": [
       "<Figure size 432x288 with 1 Axes>"
      ]
     },
     "metadata": {
      "needs_background": "light"
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "train_df['label_i'].value_counts().plot(kind='barh')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "d8f201f7-9a52-4856-a133-6642ce4f84ab",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-06-14T09:59:20.604711Z",
     "iopub.status.busy": "2022-06-14T09:59:20.604534Z",
     "iopub.status.idle": "2022-06-14T09:59:21.134881Z",
     "shell.execute_reply": "2022-06-14T09:59:21.134250Z",
     "shell.execute_reply.started": "2022-06-14T09:59:20.604695Z"
    },
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<AxesSubplot:>"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXAAAAD7CAYAAABzGc+QAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuNCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8QVMy6AAAACXBIWXMAAAsTAAALEwEAmpwYAAAgpklEQVR4nO3debyVZfX38c8CmWUUxCODgAMOPIpGhjmkOeRATjll+phppv1s0Mwwy0fzVw4/s7QsJc3UjFTENHLA2ccGB1RQARUZFBWBQISQSdbvj3Ud2Wz2cO+z9z548Pt+vc6Lve9939fe535tLm7Wva61zN0REZGWp9X6/gAiItI0msBFRFooTeAiIi2UJnARkRZKE7iISAulCVxEpIUqO4Gb2e/NbK6ZvZS3/VtmNtXMXjazy+v3EUVEpJAsV+B/AA7M3WBm+wCHATu5+w7AFbX/aCIiUspG5XZw9yfMbEDe5jOAS919edpnbpY369mzpw8YkD+UiIiUMmHChPnu3it/e9kJvIhtgD3N7KfAMuAcd3+m3EGL23Tn2WefbeJbioh8MpnZrELbm3oTcyOgF2BAA/B3M7uoyBufZmbPmtmzHy5d1MS3ExGRfE2dwGcDdwCfd/dtgDeBEWY2PH9Hdx/l7sPcfdjQrftX8VFFRCRXUyfwvwD7uPsSM9sGaEdcjasylohIM8mSRjga+Ccw2Mxmm9kpwO+BQWY2E5gKbA687+5PlRrrxbcWMWDk36r/1CIigjW1nKyZtQZeBfYHFgMzgOPd/Z68/U4DTgPo37//p2bNKhiLFxGRIsxsgrsPy99ezUrMXYFp7j7d3ecRV+mn5++UGwPv1WudLBgREWmiaibw7YB3AcysA9AfWF3qAIVQRERqJ0sMvL2ZPW1mE9Oy+cZ0wdOBr5jZB8ACIhNlZoHjlUYoIlIHWa7AlxPpgjsBQ4EDU7rgPOBFd+/g7h2Ah4G38g/OTyOceekhNfz4IiKfXGUncA9L0tM26ceB+UCDmQ00s7bAccA9RYYREZEayxJC6Wdmj5rZMiLbZEFKF+wP9ACmA/8G7nT3l0uNpRi4iEjtZKmFsgr4nrs/Z2Z9gVfN7FDgAmAuMIqY2AvKTSNs3UVZKCIitVJxHriZTQGecPdvpOePAaOBL7r7iFLHDhs2zFXMSkSkMk3OAzezXmbWLT0eDAwEHjKzhpzdPge8VODwtSiEIiJSO1lCKEOBsWbWBmgLPO/ud5jZjBRSaQUsAU4udLBCKCIi9VE2hJKutPsBFwOPERP14cAxxMQ9gqgHXjY2ohCKiEjlqllKPwc4E5ji7pcAU4A+Nf58IiJSoSwhlN2BE4EXzWwysBVwM7AtcRVuwNNmNhcY4O7Lig2UGwPXgh4RkepkWcjzpLsb8FngA+A4d78TuASYBXRKjz8gFvOsRUvpRUTqI1MaYbqBOQ54wN2vTNv6AP8CdgK6A88DR7n7+GLjKAYuIlK5YjHwsiEUMzPgBiIGfmXOS6uBK4A30vP5pSZvUAhFRKSWstzE3IeIgX/DzD4wszlmdjBx9f0L4h+BBcBUMzsh/2CFUERE6iNLGqEBnVL/yzbAk8B3gEOB3sCpwJ+IqoVL3f2bxcZSCEVEpHJNTiMsUY3wbmA40AF4GtiFSDEUEZFmkCWNsLH/5QQihfCaxubFZvZXoitPJ2ApMLHUOIqBi4jUTqaWau7+obsPBfoCu5rZkPRSA5F9cjVRWnZS/rGKgYuI1EdTqhFeQFxt/45oofY4cKS7l+yHCYqBi4g0Ra2qEXYA9gemAmcTIZjFwAQzu97MOpUaqzGEooqEIiLVyxIDbwBuSnHwVsDt7j7OzP4CtAa+DKwElgEjgR/nHqxqhCIi9VFxCOWjA802I+LfhxC1Uc4ARrp70buTCqGIiFSummqEBbn7HGAa0C1t2heY3NTxRESkMlnTCGcSse4PgVXuPszMdgK6AH8jGj0MI1ZsFqU0QhGR2sk0gSf7uPv8nOfXA98mKhI+SXTqWZh/kGLgIiL1kbUa4UxgWO4EbmaLiPDJFsD9wGp3377UOIqBi4hUrsnVCBMHxpuZA9e5+yjgZeAw4AWgK7BxuUHymxorjCIi0nRZ8sDbA+8RKYOdgJ+Y2V7Aw8AYYAawKdDRzE4pcLxWYoqI1EFTqhHOAMYCNwILiUbHxwK/cvddS42lEIqISOWqCaF0JPpeQoRKugGvA2+5+9yY3zkbuLY2H1VERLLIMoH3A55LoRQDZrn7VWb2emqr1g7omf78famBFAMXEamdLAt5XgE2dfdWQC+gh5kd7+5bunt7Io3wUSKssg7FwEVE6qOipfRm1pFYfTnG3b+dts0iQiv93f39UscrBi4iUrlqmhr3IlZgPkI0dFgI5DYv7gg8UW7yBoVQRERqKUsIZQvgTWAwMeFvDMw0s1+Z2Qoi/v0FM7uh0MEKoYiI1EdT0wjvAg4HjgLuJTJQprt7wUm8kUIoIiKVq6YaYU/WhFo6E2mEM4AVwM7AOGIp/Zdq8klFRCSTLFfgw4i2aR2INMI3gAFEO7W2RHy8I7DM3TcvNVa7hq294aRffvRcMXARkfKquQKfAPTOSSPsTnThuSUd/zbwG2BBkTdWDFxEpA7KZqF4XKIvSU+XpsfD08/u7j7NzA4Atily/ChgFKQYuK66RURqopo0whOAY83sSCJT5axyY+WnEYLCKCIiTZUlhDIKeBfYloh/rwRGEDczL0zbHyQ1bcinEIqISH1kuYm5FxE2udndh5jZBcQV+YnAQcAXge2BL7t711JjKY1QRKRy1dzEnAKsSoN0APYHJgJ/AfYhaoQ3AK/W6sOKiEh5WaoRjiKuslsBzwC3u/s4M9sbuC6N8QLw1XIDKQYuIlI7WSbwXxANjO909yE5268C/g+wK/Cgu08sdLCaGouI1EfWpsZ7AA+l8rGN28YAFxMrMZe4+3blxlEMXESkctU2Nc4f7HSiI8/EVGL2+SzHKYQiIlI7WfLA7yayTdqY2UqikNV+wDwzW0zcxOxpZt3dfWGB4xVCERGpgyxphA3ALsBlwG7Ai0RRq42A1enxEuCP7n5mqbEUQhERqVyT0wjd/R3g5fR4MZFCeBwwh8j/ng3sTqQXiohIM8lyBT4aOBpoTcoHBzYB5gOTiUl8DtDH3VuXGiu/GiEoBi4iUk41V+BfJq6yBwCTgGNT+7Sl7j7U3dsCdxD1wQu9sZbSi4jUQdY0wpnAdGCcu1+Ztr0C7E1cfc8Glrv7oFLjKAYuIlK5apoaGxEy6Qh0NbMlqUTsPcBJwD8AB8aUG0tphCIitZMlD/wIopFxG2AH4Ndm1g34a/rZGPgAuKbQwUojFBGpj6xphA3u/pyZdSZaqd1AFLL6ATAauBzo5u4/LjWWQigiIpWrphrh+8Br6fFqoi/mQqIDT1tgKnA7amosItKsslyBDyJWXwIMJtIJNwEeJTr0LAfeArZy986lxlIaoYhI5apJI5zu7jsRoZKlwMSURjgXmAW8SeSHW5E3VhqhiEgdZE0jHEB0p78N6E/UB58HbObuq8zsGOB36sgjIlJ71aYRPg48QMS6zyFCKIvT5N0KOAZYXG4spRGKiNROljTCp4mr7t5E84ZeRA54PzNbRsTE3yMaH69DaYQiIvWRZQJ/hZjAu7MmH3wGkYWyGVGlsDOwRaGD06KfUZBCKLriFhGpiSw3MU8APkM0LT4OeMTdv0JkoRxFhE86A3fX8XOKiEierB15ngR6Er0xe6ZtbwO3ECGUg4gslZIUAxcRqZ1KWqrNyOt7uZDoUn+nu/+82EGKgYuI1EeTemImrYiu9MeU2kkxcBGR+sg6gTswwMwmANelSXkQcUPzXjN7FvheoZ6YuRRCERGpnbI3Mc3sNiLbpD2wI/A/ZrYX0R9zftptN+CqIsdrJaaISB1kqYViQCd3X2JmjSmEY4EfpSX1mNn1wAh336zUWFqJKSJSuWqqEXZkTZ2TrkA34HWgUxrYgO2IzjwiItJMssTAewN3mVlj+di/u/tVZnaLmR1C5IA3LrcvSTFwEZHayVyN0N07EDVQPjSzIe5+orv3AEYS/TI3KXS8YuAiIvWRqRrhWgeYXUB0pL/CzPoCNxGt1c5z996ljlUMXESkctVUI+wFrHT398ysA7A/cLmZbQVcCpybfpaUG6tQCAUURhERaYosNzFHAfNS5cFngAeB2cAk4HBimf0A1rRdW4tCKCIi9ZEljXAv4ur6ZncfkraNJ5bS704s5ulOdOUZm4pfFaQQiohI5appqfYEsCB/MzFZ9wW+CzxBVCksOnmLiEhtVVSN0MxeIK60TwQeMLNriRrhc4jmxiUpBi4iUjtZYuCNZrj70HQZfwZRWnYCcBLR9OGzhQ5SDFxEpD6yNjWeTfTA3C49X0T0yBwFPAwscvcu5cZRDFxEpHJNTiNM1qpGSDRz2AXYE/hVjG+fdvdnSg2iEIqISO1kqUb4OtCHqEbYAPwYuIsoJ3s+UanwHOD2VBcl/3iFUERE6qDiNEIzuxA4hbgKHwkMTD/HA8PdfV6xsRRCERGpXDXVCCeQMkzMrBNwANADuBHYh1jY82Wi0NX8ImOIiEiNZYmB9wOeBdoB/wb+AUwkJus9gKlEydl9vczlvGLgIiK1k+UK/BXgU8BkonRsJ6L7zmlp+zQAd3+k0MGKgYuI1EfZK3B3dzNbmp62ST8zgYOAh4AfESsxix2vpsYiInWQNY2wFbAlMBe4hmirdiZRRvYbwMosgyiEIiJSO1nKyd4GHEFcea8CvgZ0ICZuBxYBbcxsK3efVuD404hwC6279KrdJxcR+YRralPj+4CTgXeJq+/+wJJyqzGVRigiUrlq0gh7suZKvTPR1HgK0dj4SHcfQEziv67JJxURkUyyxMAbgJvMbFtiNeYCd7/SzPoB/zCzVkQo5dJyAykGLiJSO1nqgU9y952JZfNjADezIcCOwJeAWcBFwJWFjlcaoYhIfWStRtjYvPinwG+JUrKnu/uWZjYTOAwY7e7blxpHMXARkcpV29T4GqJx8SZETHwK0NXMtkm77Z22lVQshAIKo4iIVCpLDPxYolnDjcBgokP9ODO7mlhS3w64mLgKX4fSCEVE6iNLGuElRAu1jkT+d1tgNLArMWnfR1yh7+DuXy01lkIoIiKVq6ap8XnAcOB54IfA/NS82IHGvO8uRHlZERFpJlli4O2BSUT1wXOJYlYQVQn/lR6fD9xabizFwEVEaifLQp79gDHuvg1wArDSzIYD+wM/dncDniI69KxDaYQiIvVRSQx8FbGQZ1Pgn0RYpZ27rzKzw4E/uvvGpcZSDFxEpHJNTiN09/PM7EdEZ57BwHTgSCLmPQh4FdgK+LDcWAqhiIjUTtZysq8Di4F3gC2ArYF5wAQzawt8QJSaXYfSCEVE6iPrSsyZwDB3n29mFxAT9kXAf7v7z8zsN8CB7l4wDt5IIRQRkco1OY0wrcRslR53IG5eTiFCJgvSbhuzJjtFRESaQZabmDsSKy4bd1zk7t3NbEl63gb4D9DG3TuXGqtdw9becNIvC76mGLiISGHVLOSZBMwmsk82A94ws72IJfRHu3s7Ir2wfZE3VhqhiEgdZL2J+SGAu881s7uIZfQriE48EDc2Pyh0oJoai4jUR5aVmI2x7fGpvVoX4L+APwOXpRTDLsCfyo2lNEIRkdrJcgU+FOhFhE8MWAYsBboSha02Ja7QP13oYKURiojUR5abmA1Ag7s/Z2adgZnADcBZwKHufp+ZHQzc4+4l/0FQGqGISOWqaWr8PvBaeryauApfCCwnltND1At/vwafU0REMspyBT4IuCs9HQy0Jjrz3A18LmfXo939zlJjlUojBMXBRUQKqSaNcLq770Q0cVgKTHT394FXiEm7FXAvRbrSK41QRKQ+si6lH0AUs7oN6O/uI8xsEdAt7fIb4GR3L5gL3kgxcBGRylXT1NiAx4EHgNuBc9JLbxNX3rsQRa4mlxurVBohKIQiIlKJLDcxv08s2DmCmMQPSlknX2dNCuFOwEOFDlYIRUSkPrLkgXcnJul5RN53F+B4dz8hNXK4nviHYKdCB2slpohIfWRtajybCJUcBzwCnGhmWwG/IPpkdgCm1fFziohInqw3MVcSLdWMyAXfGHiXuDpfSVQk/Ky7P11qHKURiohUrpqFPABzgH5ELPw1YA9gFrCJu3cAFgHfLPLGioGLiNRBUzryXEjExL9F5IVDTO5Lga3dfU6xcZRGKCJSuWo68uRWI3ye6FD/jLtvCtxKrMz8kJjgi07eIiJSW1myUP4A9AV6E2GTjsCuZvZzYDuiNviDwCnEDc215FcjLJUHDoqDi4hklaUWyl7AEuBmdx+SQigdiYqE1xELe+YC97r7kFJjKYQiIlK5am5iTiAqDzaGUw4gUglX5+xzGDC1Bp9TREQyyhJC6Q3cAQwE5gOz3P1+M3sC2B34O3EFvnu5gcotpW+kMIqISHmZqhECBwPvEWVlp5lZK2JCf5aYuG8E9i90vNIIRUTqI2tT482AzsSy+bOJeuArWNPI+EHgPCIuvhYtpRcRqY+sC3kuIBbzNMa95xOTf+f0/CgiF1xERJpJlnKyc4g4OEQs/G2iuUNfYmXmM8Tk/m6R4ytKIwTFwEVEssiSRvhHIr7dHVhAVCOcRqQVXpH2OQA41d2PKTWW0ghFRCpXTUu1E4DPAK+yphrhWKKgFWbWDvgBcG0tP7CIiJSW9SZmIWeZ2fnEPwLzgB2Jyb0opRGKiNRORRO4uz8GPGZmvYmr8D8BfyM684wws3HuvlZd8PwYuIiI1EYlTY3H5S6VN7OjgQOBi4FxRMPj5e5+ebFxFAMXEalctfXA8wdrAF4C9gS+AkwhFvsolVBEpJlkSSMcDewN9DSzFcRkPQk4kujEcxFRTvbt9Gf+8UojFBGpg0whFAAzOxsYBnRx9xF5r91JTOb3u/tvio2hEIqISOWqCqGYWV/gEGIpfe72Tc2sC7AvsC1xU1NERJpB1iyUXxLNGjrnbb8TGJQen+Hu75UaRGmEIiK1k6Wl2ghgrrtPKPByP2Jl5kLgsiLHqxqhiEgdZFlKfwnRB3MV0J6YsMe6+wlm9gaxInNzd19W7s0UAxcRqVw1S+nPc/e+7j6AtJQ+La8H6ASMzzJ5i4hIbWWKgZvZTGAx0AHombb9D1Hg6ggzWwhc6O5XFTi24jTCRoqFi4gUl3Ul5kxgmLvPz9l2ADDV3d8ws18BxwJHufsTxcZRCEVEpHI1XYkJ4O7j3f2N9PQRoib4rk0dT0REKpM1jdCB8WbmwHXuPip1qG/l7ouBrxM3OF8qNUjWNMJGCqGIiBSX9Qp8D+DTQFvgCjPbi1i0M8/MVgL7Abe4+/35ByqNUESkPpqylH4Y0aR4MlFG9hvAm8Cj7v7bUmMoBi4iUrkmx8DNrJOZDSaW0v+RyEJ5iWhwfC5wKPBPokemiIg0kywx8D8AXwJWEsvm5xK1T76TXp9GhFYeKHSw0ghFROojy0rMHwCfArYHzgTOAa4mFvSsMrOJwAp3/3S5N1MIRUSkcsVCKFmuwLsRjRs2Af5MLKV/z93Hm9n/A5YRV+EiItKMyk7g7n6emV1HtE07Ezgn1UE5FfgCMB+4L8ubKY1QRKR2suaBP0ncvLw+/QlwHZEf3hrYyswGuftP8g9UU2MRkfrIupR+NrDY3bfL2XYecAxRI+Vsdy8b3FYMXESkcrVuanwgUWL2ACKdUEREmlmWpsZ3A33SYweeAXoAmwNvECmEj5vZa+4+tMDxTU4jBMXBRUSKyZJG2ADs6O4PmNlAYCpwqrvfkl5/DJgNvFooBp5LIRQRkcpV09DhHXd/ID2eAcwgaqPk2g8YXYsPKiIi2WQJoXxUddDMtgMGAj/M2aUrsMDdXys3VqVphI0URhERWVeWm5hDgXfMbBnwMvCiu481s9vNbEV6fZsUSlmHqhGKiNRH1hh4P+Bi4DHgZOBwIu69FHgL+DXQx91PLzWWYuAiIpWrJo1wDrECc4q7XwJMISbr94nY91SiY322urQiIlITWVZiHkHkfC83s9PTMTen124ANgM+A/yu0MHVphGCYuAiIoVkDaE0AK8C/x/oTVx59wbOJ+qEnw30cPfvlxpLIRQRkcpVlUYIvAjcCdxCLOTpA5wBXOruy4FbgYNq+olFRKSkLGmERoRKpgBjge8CTwFXAXua2U+JlZlvFBujUVPTCEFhFBGRfBXFwIFvAwuJhTxbEiGUlcB/gA5mZp4Xk1E1QhGR+qgmjfAmYKS7P2xmlxGT9DbuPq/YWIqBi4hUruZphERY5XNpn1lEUav5tfm4IiJSTpYQyu5ECOVFM5sMbEWkEY4H/mFm5wJtgIfzwydQmzRCUAxcRCRfpoYOAGa2MfA48NO0lL4xvfAgYDgwGDjc3ScXG0MhFBGRylXV0MHM2hBphLe6+1j4KL1wR2AEcCxrQisiItIMKkojdPcrc7YfCJxLxME3BXYm0guLqiaNEBRGERHJlSUG/lditeVyM9s7bfsZ8Aci9j09jfNEqo+yFqURiojUR5Y0wr2AJcDN7j4kbbscWAD8nCgx+6a771vuzRQDFxGpXDVL6Z8gJutchxF54DcQueF9a/AZRUSkAllCKB8xs+eJ+t+9iZWYJwIrgLZmNhs4zd3vzTumJmmEjRQHFxEJmdIIzWwA8E/gUaALsZT+LGAf4KvAv4HB7j631DgKoYiIVK6qNEKi5ndn4Pr0/F2iLspPiKvxueUmbxERqa2sIZQLiCX1q9Pze4BvEvnfZwArzGzrco2Nq00jbKQwiohIhitwM3ucyPXeAngQ2BO4FGhHlJZdTazI/H2R49XUWESkDrKkEV5C3KzsCHQgilaNBoYRC3mOJkrOfujuXUuNpRi4iEjlqkkjPI+odfI88ENgvrufANwNXEZM4q2JlmsiItJMssbAryG6zo8ENjGz3YBlRJ3wmcRV+VnlBlEMXESkdrLUQhlBVBq8Avhv4PtEV55TgJ+5+8/MbDnwReDJAsd/lAfev39/Tb4iIjWSJQZ+JZEyOBtoT+SBLyfSCt8BPgT6A6vcvW2psRQDFxGpXLEYeJYQys3Ewp3JRHOHFcB2wFvu3i8NvgRYVW4ghVBERGonywTeh8g4aSwl2ImIhQNgZt9L2wrmCKoaoYhIfWQJoewI3O/um5tZZ+JKfDqxOvN4orTsDsAyd9+q1FgKoYiIVK6aNMJJwOtmNtjdFwP/IZbS30PUBD+XuAK/r6afWERESsqaRvgt4FYzG0I0cTgAuIWIhT+dto0vN0itYuCNFAsXkU+yTBO4u79gZmOAQUT4ZD6wPfA1d7/ZzGYS1QrXoTRCEZH6yNrUeACR//1noqjVlsDGwBVp8u4LPGdmm+Uf6+6j3H2Yuw/r1Us3MUVEaiVrU+PHgQeA24Fz3P1FM7sN2I3ICf8PMNzd55QaSyEUEZHayXIF/n1ioc4Qoh74nmZ2cHr8AbHEfmNgupl9N/9gVSMUEamPSqoRrmLNSsyxqaBV4z6fB+4luvLMKjaW0ghFRCpXVTVCd+/r7gOA44BH3P0EM2tIAxtwJlGlsOjkLSIitVVRU+NGZrYt8FRa2PMuEQO/stxxtY6B51NMXEQ+STI1NV7nILNNiQ49hwPvE3HyHdz93QL75qYRfmrWLF2ki4hUotqmxmtx97nu/gywEtgWeK7Q5J32VRqhiEgdNCmEkmcocHWWHesdQhER+TiqV3i3SVfgOdoA2wBji+2gNEIRkfrIHAM3s/8Cvp6eHuzub5vZhcASd78iyxhKIxQRqVyxGHiTbmLmDHohFUzgZrYYeKXJb/jJ0JOoNSPF6RyVp3NUXks6R1u4+zo3EZuahbIZ8CyxqGc1sATY3t3fL3Pcs4X+FZE1dI7K0zkqT+eovA3hHDXpJmaqedK3xp9FREQqUO1NTBERWU+aewIf1czv1xLpHJWnc1SezlF5Lf4cVXUTU0RE1h+FUEREWqhmmcDN7EAze8XMppnZyOZ4z48LM+tnZo+a2WQze9nMvpO29zCzB83stfRn97TdzOzqdK4mmdkuOWOdlPZ/zcxOWl+/U72YWWsze97MxqXnA83sqXQubjOztml7u/R8Wnp9QM4Y56Xtr5jZF9bTr1IXZtbNzMaY2VQzm2Jmu+l7tDYzOyv9PXvJzEabWfsN+nvk7nX9AVoDrxP9NNsCE4mUw7q/98fhB2gAdkmPOwOvEv1ELwdGpu0jgcvS44OB+wADhgNPpe09iH6kPYDu6XH39f371fhcnQ38CRiXnt8OHJceXwuckR5/E7g2PT4OuC093j59v9oBA9P3rvX6/r1qeH5uAk5Nj9sC3fQ9Wuv89AFmAB1yvj9f3ZC/R81xBb4rMM3dp7v7CqKv5mHN8L4fC+7+jrs/lx4vBqYQX7TDiL+QpD8PT48PA2728C+gW6q9/gXgQXdf4O4LgQeBA5vvN6kvM+sLHEJ0emqsM/95YEzaJf8cNZ67McC+af/DgD+7+3J3nwFMI75/LZ6ZdQX2Am4AcPcV7v4e+h7l2wjoYGYbAR2Bd9iAv0fNMYH3Ad7MeT47bfvESf9F2xl4Cujt7u+kl+YAvdPjYudrQz+PvwTOJRaGAWwCvOfuq9Lz3N/3o3ORXl+U9t+Qz9FAYB5wYwozXW9mndD36CPu/hZwBfAGMXEvAiawAX+PdBOzmZjZxsCdwHc9b8Wqx//bPrHpQGY2Apjr7hPW92f5GNsI2AX4rbvvTDRRWet+kr5H1p24eh4IbA50YsP638U6mmMCfwvol/O8b9r2iWFmbYjJ+1Z3b6zc+G5OW7oGYG7aXux8bcjncXfgUDObSYTYPg9cRfy3v3G1cO7v+9G5SK93Bf7Nhn2OZgOz3f2p9HwMMaHre7TGfsAMd5/n7iuJKqm7swF/j5pjAn8G2DrdCW5L3Cy4pxne92MhxdRuAKa4e27buXuAxgyAk4C7c7b/35RFMBxYlP6L/ABwgJl1T1caB6RtLZ4X7rv6FeBR4Ki0W/45ajx3R6X9PW0/LmUXDAS2Bp5upl+jrjzKV7xpZoPTpn2Byeh7lOsNYLiZdUx/7xrP0Yb7PWqmu8MHE9kXrwPnr+87t835A+xB/Ld2EvBC+jmYiLU9DLwGPAT0SPsbcE06Vy8Cw3LG+hpxQ2UacPL6/t3qdL72Zk0WyiDiL8404A6gXdrePj2fll4flHP8+encvQIctL5/nxqfm6FEEblJwF+ILBJ9j9Y+RxcBU4GXgFuITJIN9nuklZgiIi2UbmKKiLRQmsBFRFooTeAiIi2UJnARkRZKE7iISAulCVxEpIXSBC4i0kJpAhcRaaH+F7wO1o1yOBv1AAAAAElFTkSuQmCC\n",
      "text/plain": [
       "<Figure size 432x288 with 1 Axes>"
      ]
     },
     "metadata": {
      "needs_background": "light"
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "train_df['label_j'].value_counts().plot(kind='barh')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8b1e7e0e-a8f9-4104-aea2-4cf596e3b869",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-06-14T08:12:29.252779Z",
     "iopub.status.busy": "2022-06-14T08:12:29.252169Z",
     "iopub.status.idle": "2022-06-14T08:12:29.269084Z",
     "shell.execute_reply": "2022-06-14T08:12:29.268165Z",
     "shell.execute_reply.started": "2022-06-14T08:12:29.252728Z"
    },
    "tags": []
   },
   "source": [
    "## 文本长度"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "a3a10bcf-919d-455e-a1ef-1dbed1a12a3a",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-06-14T09:59:21.135943Z",
     "iopub.status.busy": "2022-06-14T09:59:21.135714Z",
     "iopub.status.idle": "2022-06-14T09:59:21.527900Z",
     "shell.execute_reply": "2022-06-14T09:59:21.527392Z",
     "shell.execute_reply.started": "2022-06-14T09:59:21.135926Z"
    },
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<seaborn.axisgrid.FacetGrid at 0x7fc9f1526cc0>"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAWAAAAFgCAYAAACFYaNMAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuNCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8QVMy6AAAACXBIWXMAAAsTAAALEwEAmpwYAAAZ2ElEQVR4nO3df5BldXnn8ffH+YEjg0yzTqZYIAHNxA1mI5IJsmqyKokCG4MaQrAMoIuiEVyMSSyMW6sVy1qNoltGHTMGitFSWIwSRnciIhBZa0UYCQ6/RCYICoUwMnNbQVBmePaPexovY/dMM/Tt7+3p96vqVp/7nB/3mdM9nz597jnfm6pCkjT7ntS6AUmarwxgSWrEAJakRgxgSWrEAJakRha2bmAYjj766PrSl77Uug1JmpDJinvkEfAPf/jD1i1I0i7tkQEsSXOBASxJjRjAktSIASxJjRjAktTI0AI4yUFJrkhyU5Ibk5zZ1d+V5K4k13WPYwfWeXuSTUluSfLSgfrRXW1TkrOG1bMkzaZhXge8DfiLqro2yT7AN5Nc2s37UFV9YHDhJIcCJwLPAv498JUkv9bN/ijw+8CdwDVJ1lXVTUPsXZKGbmgBXFV3A3d30z9OcjNwwE5WOQ64oKp+Cnw3ySbgiG7epqq6DSDJBd2yBrCkOW1WzgEnORh4DvCNrnRGko1Jzk0y1tUOAL4/sNqdXW2q+o6vcVqSDUk2bN68eab/CZI044YewEmWAp8D3lJVPwJWA88ADqN/hHz2TLxOVa2pqlVVtWr58uUzsUlJGqqhjgWRZBH98P10VX0eoKruGZj/CeCL3dO7gIMGVj+wq7GTuiTNWcO8CiLAOcDNVfXBgfr+A4u9Arihm14HnJhkrySHACuBq4FrgJVJDkmymP4bdeuG1bckzZZhHgE/HzgJuD7JdV3tr4FXJTkMKOB24A0AVXVjkgvpv7m2DTi9qrYDJDkDuARYAJxbVTcOsW+qil6vB8CyZcvo/y6RpJmVPfFDOVetWlUbNmzY7fW3bt3KSasvB+BTf/ZixsbGdrGGJO3UpEdxe+R4wDNh8ZKlrVuQtIfzVmRJasQAlqRGDGBJasQAlqRGDGBJasQAlqRGDGBJasQAlqRGDGBJasQAlqRGDGBJasQAlqRGDGBJasQAlqRGDGBJasQAlqRGDGBJasQAlqRGDGBJasQAlqRGDGBJasQAlqRGDGBJasQAlqRGDGBJasQAlqRGDGBJasQAlqRGDGBJasQAlqRGDGBJasQAlqRGDGBJasQAlqRGDGBJasQAlqRGDGBJasQAlqRGDGBJasQAlqRGDGBJasQAlqRGDGBJasQAlqRGDGBJasQAlqRGDGBJasQAlqRGDGBJamRoAZzkoCRXJLkpyY1Jzuzq+yW5NMmt3dexrp4kH06yKcnGJIcPbOuUbvlbk5wyrJ4laTYN8wh4G/AXVXUocCRwepJDgbOAy6pqJXBZ9xzgGGBl9zgNWA39wAbeCTwXOAJ450RoS9JcNrQArqq7q+rabvrHwM3AAcBxwNpusbXAy7vp44BPVt9VwLIk+wMvBS6tqi1VtRW4FDh6WH1L0myZlXPASQ4GngN8A1hRVXd3s34ArOimDwC+P7DanV1tqvqOr3Fakg1JNmzevHlm/wGSNARDD+AkS4HPAW+pqh8NzquqAmomXqeq1lTVqqpatXz58pnYpCQN1VADOMki+uH76ar6fFe+pzu1QPf13q5+F3DQwOoHdrWp6pI0pw3zKogA5wA3V9UHB2atAyauZDgFuHigfnJ3NcSRwHh3quIS4CVJxro3317S1SRpTls4xG0/HzgJuD7JdV3tr4H3AhcmORW4Azihm7ceOBbYBPwEeC1AVW1J8m7gmm65v6mqLUPsW5JmxdACuKq+BmSK2UdNsnwBp0+xrXOBc2euO0lqzzvhJKkRA1iSGjGAJakRA1iSGjGAJakRA1iSGjGAJakRA1iSGjGAJakRA7ixqmLr1q30bwSUNJ8YwI31ej1OPPtier1e61YkzTIDeAQsWrK0dQuSGjCAJakRA1iSGjGAJakRA1iSGjGAJakRA1iSGjGAJakRA1iSGjGAJakRA1iSGjGAJakRA3gXHK1M0rAYwLswPj7uaGWShsIAngZHK5M0DAawJDViAEtSIwawJDViAEtSIwawJDViAEtSIwtbNzBKqoper+dNF5JmhUfAAyY+In58fLx1K5LmAQN4B950IWm2GMCS1IgBLEmNGMCS1IgBLEmNGMCS1IgBLEmNGMCS1IgBLEmNGMCS1IgBLEmNGMCS1IgBLEmNGMCS1IgBLEmNGMCS1IgBLEmNDC2Ak5yb5N4kNwzU3pXkriTXdY9jB+a9PcmmJLckeelA/eiutinJWcPqV5Jm2zCPgM8Djp6k/qGqOqx7rAdIcihwIvCsbp2PJVmQZAHwUeAY4FDgVd2ykjTnDe1DOavqyiQHT3Px44ALquqnwHeTbAKO6OZtqqrbAJJc0C1700z3K0mzrcU54DOSbOxOUYx1tQOA7w8sc2dXm6r+C5KclmRDkg2bN28eRt+SNKNmO4BXA88ADgPuBs6eqQ1X1ZqqWlVVq5YvXz5Tm5WkoRnaKYjJVNU9E9NJPgF8sXt6F3DQwKIHdjV2UpekOW1Wj4CT7D/w9BXAxBUS64ATk+yV5BBgJXA1cA2wMskhSRbTf6Nu3Wz2LEnDMrQj4CTnAy8EnpbkTuCdwAuTHAYUcDvwBoCqujHJhfTfXNsGnF5V27vtnAFcAiwAzq2qG4fVsyTNpmFeBfGqScrn7GT59wDvmaS+Hlg/g61J0kjwTjhJasQAlqRGDGBJasQA3kFVMT4+3roNSfOAAbyDbQ89wJlrr2Tb9u2tW5G0hzOAJ7HwyXu3bkHSPGAAS1Ij0wrgJM+fTk2SNH3TPQL+u2nWJEnTtNM74ZL8J+B5wPIkbx2Y9VT6twZLknbTro6AFwNL6Qf1PgOPHwHHD7e10VFV9Ho9qqp1K5L2IDs9Aq6qrwJfTXJeVd0xSz2NnG0PPcDr11zOZ9/2SsbGxna9giRNw3QH49kryRrg4MF1qurFw2hqFC1csrR1C5L2MNMN4M8CHwf+AfAOBUmaAdMN4G1VtXqonUjSPDPdy9C+kORNSfZPst/EY6idSdIebrpHwKd0X/9qoFbA02e2HUmaP6YVwFV1yLAbkaT5ZloBnOTkyepV9cmZbUeS5o/pnoL47YHpJwNHAdcCBrAk7abpnoJ48+DzJMuAC4bRkCTNF7s7HOUDgOeFJekJmO454C/Qv+oB+oPw/Dpw4bCakqT5YLrngD8wML0NuKOq7hxCP5I0b0zrFEQ3KM+36Y+ENgb8bJhNjaKJEdG2bt3qqGiSZsR0PxHjBOBq4I+BE4BvJJk3w1FCf0S0N33qak5afTm9Xq91O5L2ANM9BfEO4Ler6l6AJMuBrwD/OKzGRtGip+zDwgWOQy9pZkz3KognTYRv577Hsa4kaRLTPQL+UpJLgPO7538CrB9OS5I0P+zqM+F+FVhRVX+V5JXAC7pZXwc+PezmJGlPtqsj4P8FvB2gqj4PfB4gyX/s5r1siL1J0h5tV+dxV1TV9TsWu9rBQ+lIkuaJXQXwsp3MWzKDfUjSvLOrAN6Q5PU7FpO8DvjmcFqSpPlhV+eA3wJclOTV/DxwVwGLgVcMsS9J2uPtNICr6h7geUleBPxGV/4/VXX50DuTpD3cdMcDvgK4Ysi9SNK8Mt0bMTTDJgb3cVwJaf4ygBvp9XqctPpyHn7wfrLoya3bkdSAAdzQ4iVLoYpt27e3bkVSAw6oI0mNGMCS1IgBLEmNGMCS1IgBLEmNGMCS1IgBLEmNGMCS1Ig3YuyGqmLLli0AjI2NkaRxR5LmIo+Ad8P4+DjHv+d8Tnj/RY7lIGm3DS2Ak5yb5N4kNwzU9ktyaZJbu69jXT1JPpxkU5KNSQ4fWOeUbvlbk5wyrH4fr0VL9mbhkqWt25A0hw3zCPg84OgdamcBl1XVSuCy7jnAMcDK7nEasBr6gQ28E3gucATwzonQlqS5bmgBXFVXAlt2KB8HrO2m1wIvH6h/svquApYl2R94KXBpVW2pqq3ApfxiqEvSnDTb54BXVNXd3fQPgBXd9AHA9weWu7OrTVWXpDmv2ZtwVVVAzdT2kpyWZEOSDZs3b56pzUrS0Mx2AN/TnVqg+3pvV78LOGhguQO72lT1X1BVa6pqVVWtWr58+Yw3LkkzbbYDeB0wcSXDKcDFA/WTu6shjgTGu1MVlwAvSTLWvfn2kq4mSXPe0G7ESHI+8ELgaUnupH81w3uBC5OcCtwBnNAtvh44FtgE/AR4LUBVbUnybuCabrm/qaod39iTpDlpaAFcVa+aYtZRkyxbwOlTbOdc4NwZbE2SRoJ3wklSIwawJDViAEtSIwawJDViAEtSI44H/ARU1aPDUS5btmxa4wJPrNO/8EPSfOYR8BOw7aEHeNOnruak1ZdPe1zgXq/HiWdfzPj4+HCbkzTyPAJ+ghY9ZR8WLljw+NZxHGFJeAQsSc0YwJLUiAEsSY0YwJLUiAE8iwYvW5MkA3gW9Xo9Tv3IerZt3966FUkjwACeZYuW7N26BUkjwgCWpEYM4Bng7cWSdocBPAMefvB+Xr9m+rcjSxIYwDNmobcXS3qcDOAZVFVs3brVUxGSpsUAnkHj4+OcePbFnoqQNC0G8AxzpDNJ02UAS1IjBrAkNeKA7DNsdz6mSNL85BHwDNudjymSND95BDwEu/MxRZLmH4+AJakRA1iSGjGAh8QBeiTtigE8JA7QI2lXfBNuiCYG6Jk4GjaMJQ0ygGdBr9fjpNWX8/CD97Nt+3YWtW5I0kjwFMQsWbxkKYue7McRSfo5A1iSGjGAJakRzwEPWVUxPj7eug1JI8gj4CEbHx/n1I+sZ9v27a1bkTRiDOBZsGiJb75J+kUGsCQ1YgBLUiMGsCQ1YgBLUiMGsCQ1YgBLUiMGsCQ1YgBLUiMGsCQ1YgBLUiMGsCQ1YgBLUiNNAjjJ7UmuT3Jdkg1dbb8klya5tfs61tWT5MNJNiXZmOTwFj1L0kxreQT8oqo6rKpWdc/PAi6rqpXAZd1zgGOAld3jNGD1rHcqSUMwSqcgjgPWdtNrgZcP1D9ZfVcBy5Ls36A/SZpRrQK4gC8n+WaS07raiqq6u5v+AbCimz4A+P7Aund2tcdIclqSDUk2bN68eVh9S9KMafWRRC+oqruS/BJwaZJvD86sqkpSj2eDVbUGWAOwatWqx7WuJLXQ5Ai4qu7qvt4LXAQcAdwzcWqh+3pvt/hdwEEDqx/Y1SRpTpv1AE6yd5J9JqaBlwA3AOuAU7rFTgEu7qbXASd3V0McCYwPnKqQpDmrxSmIFcBFSSZe/zNV9aUk1wAXJjkVuAM4oVt+PXAssAn4CfDa2W9ZkmberAdwVd0GPHuS+n3AUZPUCzh9FlqTpFk1SpehSdK8YgBLUiMGsCQ1YgBLUiMGsCQ1YgBLUiMGsCQ1YgBLUiMGsCQ1YgBLUiMGsCQ1YgBLUiMGsCQ1YgBLUiMGsCQ1YgBLUiMGsCQ1YgBLUiMGsCQ1YgBLUiMGsCQ1YgBLUiMGsCQ1YgBLUiMGsCQ1YgBLUiMGsCQ1YgBLUiMGsCQ1YgBLUiMGsCQ1YgBLUiMGsCQ1YgBLUiMGsCQ1YgBLUiMG8AirKrZu3UpVtW5F0hAYwCOqqrj99ts58eyL6fV6rduRNAQG8Ijq9Xqc+pH1ZPGS1q1IGhIDeIQtWrJ36xYkDZEBLEmNLGzdgPrne3u9Hvvuuy/j4+OP1iTt2TwCHgHbHnqA16+5nDvuuIOTVl/On37sMr73ve+1bkvSkBnAI2LhkqUALF6ylABnrr2Sbdu3t21K0lAZwCNq4ZN9A07a0xnAktSIASxJjRjAktSIASxJjRjAktSIATxHTIyM9sgjjzhCmrSHmDMBnOToJLck2ZTkrNb9zLZer8effOCf2Lhx46MjpFUVW7Zs4b777mPLli1U1WNqE/WJwN5xeMsd1zfYpdk1J25FTrIA+Cjw+8CdwDVJ1lXVTTP1GhO3A4+yJJy59kqWjK0A+qF8/HvOZ/sj21n05L357NteCfBoDeApYytYsHAhn3zjixgfH+cN//Av/P3rXsiyZcuoqkeXnVjuU3/2YsbGxh7dHxOBPDY2RpIpe5tYftmyZY8uN1ltZ6Za/vFuZ6rtAru9DWkY5kQAA0cAm6rqNoAkFwDHATMWwL1ej5M/8I88afFeADz8kx/zyMKFjI+P8/CDD7D9ke1T1x5+iO3btj1af9LiqWvbHnqAAOPj4/zswft5+KEHpqwBpOsN4OEH+7VtD94/6S+Lnf0C+d73vsebP3EpT1q8F69fczkLFizgfccfNuU2er0ebzznq/1/5/btnHfmy1i2bNlO99+pH1nPOWcc++hyk9V2ZqrlH+92JtvuG8/5KgAfP/U/79Y2JOgfiMykzIU/OZMcDxxdVa/rnp8EPLeqzhhY5jTgtO7pM4FbHsdLPA344Qy1Owyj3h+Mfo/298SMen8w2j3+sKqO3rE4V46Ad6mq1gBrdmfdJBuqatUMtzRjRr0/GP0e7e+JGfX+YG70uKO58ibcXcBBA88P7GqSNGfNlQC+BliZ5JAki4ETgXWNe5KkJ2ROnIKoqm1JzgAuARYA51bVjTP4Ert16mIWjXp/MPo92t8TM+r9wdzo8THmxJtwkrQnmiunICRpj2MAS1Ij8z6AR+0W5yQHJbkiyU1JbkxyZld/V5K7klzXPY5t2OPtSa7v+tjQ1fZLcmmSW7uvM3vF+vR7e+bAProuyY+SvKX1/ktybpJ7k9wwUJt0n6Xvw93P5MYkhzfq7/1Jvt31cFGSZV394CQPDuzLjzfqb8rvaZK3d/vvliQvHXZ/u21ijID5+KD/ht6/AU8HFgPfAg5t3NP+wOHd9D7Ad4BDgXcBf9l6n3V93Q48bYfa3wJnddNnAe8bgT4XAD8AfqX1/gN+FzgcuGFX+ww4Fvhn+jdCHgl8o1F/LwEWdtPvG+jv4MHlGu6/Sb+n3f+XbwF7AYd0/8cXtP55nOwx34+AH73Fuap+Bkzc4txMVd1dVdd20z8GbgYOaNnTNB0HrO2m1wIvb9fKo44C/q2q7mjdSFVdCWzZoTzVPjsO+GT1XQUsS7L/bPdXVV+uqm3d06voX3/fxBT7byrHARdU1U+r6rvAJvr/10fOfA/gA4DvDzy/kxEKuyQHA88BvtGVzuj+HDy31Z/4nQK+nOSb3S3gACuq6u5u+gfAijatPcaJwPkDz0dl/02Yap+N4s/lf6V/VD7hkCT/muSrSX6nVVNM/j0dxf03qfkewCMryVLgc8BbqupHwGrgGcBhwN3A2e264wVVdThwDHB6kt8dnFn9vwObXt/Y3bDzh8Bnu9Io7b9fMAr7bCpJ3gFsAz7dle4GfrmqngO8FfhMkqc2aG2kv6fTMd8DeCRvcU6yiH74frqqPg9QVfdU1faqegT4BA3/pKqqu7qv9wIXdb3cM/Fncvf13lb9dY4Brq2qe2C09t+AqfbZyPxcJnkN8AfAq7tfEnR/2t/XTX+T/jnWX5vt3nbyPR2Z/bcr8z2AR+4W5/QHqz0HuLmqPjhQHzwH+Arghh3XnQ1J9k6yz8Q0/TdqbqC/307pFjsFuLhFfwNexcDph1HZfzuYap+tA07uroY4EhgfOFUxa5IcDbwN+MOq+slAfXn6Y3ST5OnASuC2Bv1N9T1dB5yYZK8kh3T9XT3b/U1L63cBWz/ov+P8Hfq/xd8xAv28gP6fohuB67rHscCngOu7+jpg/0b9PZ3+O8zfAm6c2GfAvwMuA24FvgLs13Af7g3cB+w7UGu6/+j/MrgbeJj+OclTp9pn9K9++Gj3M3k9sKpRf5von0ud+Dn8eLfsH3Xf++uAa4GXNepvyu8p8I5u/90CHNPqZ3FXD29FlqRG5vspCElqxgCWpEYMYElqxACWpEYMYElqZE58IoaU5F3A/cBTgSur6iuN+rgd+GZV/VH3/HjgD6rqNS360dxmAGtOqar/0boH4LeSHFpVN7VuRHObpyA0spK8I8l3knwNeGZXO6876iTJe9MfN3ljkg90teVJPpfkmu7x/K5+RJKvdwPI/L8kE9t7VpKru/FkNyZZ2dX/dKD+9xN3fnXOpn+h/479TvUar0nyT92Yv7cnOSPJW7vlrkqyX7fcM5J8qRvk6P8m+Q9D27kaDa3vBPHhY7IH8Fv073J6Cv3TDpuAvwTOA46nfxfZLfz8cw2XdV8/Q3+wIIBfpn9LN902Jsa2/T3gc93039Ef5wD6Y0IvAX4d+AKwqKt/DDi5m76d/qhlNwO/2vVy3i5e4zVd//sAy4Fx4I3dvA/RH3AJ+nfFreymnwtc3vr74GO4D09BaFT9DnBRdWMQJNlxjI5x4CHgnCRfBL7Y1X8POLQ/pAYAT+1GltsXWNsd4RawqJv/deAdSQ4EPl9VtyY5iv4vgGu67SzhsYMLbQfeD7ydxw7RONVrAFxR/fGdf5xknH7AQ/+XzG92PT4P+OxA73vtejdpLjOANSdV1bYkR9AfdP144AzgxfRPqx1ZVQ8NLp/kI/RD8BXdOMv/0m3nM0m+AfwXYH2SN9Afi2FtVb19Jy18in4ADw7q8+7JXqPz04HpRwaeP0L//+GTgF5VHTadf7/2DJ4D1qi6Enh5kiXd6GsvG5w5cVRbVeuBPwee3c36MvDmgeUO6yb35edDEr5mYP7Tgduq6sP0RyP7TfqnAo5P8kvdMvsl+ZXB16+qh+mfPvjzgfKkrzEd1R/z+btJ/rh7zSR59i5W0xxnAGskVf9jmf43/VHX/pn+0KGD9gG+mGQj8DX6A4MD/DdgVfeG2k3AG7v63wL/M8m/8ti//E4AbkhyHfAb9D8K6Cbgv9P/1I+NwKX0P6tvR+fssK2pXmO6Xg2cmmRipLmmH4+l4XM0NElqxCNgSWrEAJakRgxgSWrEAJakRgxgSWrEAJakRgxgSWrk/wMRCVV/7ZDDSgAAAABJRU5ErkJggg==\n",
      "text/plain": [
       "<Figure size 360x360 with 1 Axes>"
      ]
     },
     "metadata": {
      "needs_background": "light"
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "sns.displot(train_df['diseaseName'].apply(len))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "2acae79e-b3c5-43c1-9a6a-c865b03c899d",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-06-14T09:59:21.530323Z",
     "iopub.status.busy": "2022-06-14T09:59:21.530160Z",
     "iopub.status.idle": "2022-06-14T09:59:21.851996Z",
     "shell.execute_reply": "2022-06-14T09:59:21.851508Z",
     "shell.execute_reply.started": "2022-06-14T09:59:21.530308Z"
    },
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<seaborn.axisgrid.FacetGrid at 0x7fc9e22c1b00>"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAWEAAAFgCAYAAABqo8hyAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuNCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8QVMy6AAAACXBIWXMAAAsTAAALEwEAmpwYAAAcx0lEQVR4nO3df5Bd5X3f8feHBcnCMpFkJPFLKrKhcXFiY6oQ/GMyjj0xgpkOOOMfeFybcUlJpyJjp2lSTDJjOwlTJ23i1imhIYYae4gpwaZWUjDB+FfSxAaZYJkfISjI1gowCAuDHTAOy7d/3LNwtdpdrdDeffbuvl8zd/be55x77vdwtR+efc45z0lVIUlq45DWBUjSYmYIS1JDhrAkNWQIS1JDhrAkNXRo6wIGYdOmTfW5z32udRmSlP2tsCB7wo888kjrEiRpRhZkCEvSsDCEJakhQ1iSGjKEJakhQ1iSGjKEJakhQ1iSGjKEJakhQ1iSGjKEJakhQ1iSGjKEJamhBTmLWmtjY2Ps3Lnz2dfr169nZGSkYUWS5itDeAB27tzJeZfcwOGr1vLEnoe4fPMZbNiwoXVZkuYhQ3hADl+1luVHHtO6DEnznGPCktSQISxJDRnCktSQISxJDRnCktSQISxJDRnCktSQISxJDRnCktSQISxJDRnCktSQISxJDRnCktSQISxJDRnCktSQISxJDRnCktSQISxJDRnCktSQISxJDQ0shJO8IMktSb6R5M4kH+raNyT5WpLtSf53kiVd+9Lu9fZu+fF923p/135PktMHVbMkzbVB9oSfAt5QVa8ETgY2JTkN+B3gI1V1AvAocF63/nnAo137R7r1SHIScA7wcmAT8IdJRgZYtyTNmYGFcPX8oHt5WPco4A3AtV37lcDZ3fOzutd0y9+YJF371VX1VFXtALYDpw6qbkmaSwMdE04ykuR24GHgJuAfgO9V1dPdKruAY7vnxwKjAN3yx4AX97dP8p7+zzo/ydYkW3fv3j2AvZGk2TfQEK6qsao6GTiOXu/1ZQP8rMuqamNVbVy9evWgPkaSZtWcnB1RVd8Dvgi8GliR5NBu0XHA/d3z+4F1AN3yHwO+298+yXskaagN8uyI1UlWdM+XAT8H3E0vjN/SrXYu8Nnu+ZbuNd3yL1RVde3ndGdPbABOBG4ZVN2SNJcO3f8qz9vRwJXdmQyHANdU1Z8nuQu4OslvA38LXN6tfznwySTbgT30zoigqu5Mcg1wF/A0sLmqxgZYtyTNmYGFcFVtA141Sft9THJ2Q1X9EHjrFNu6GLh4tmuUpNa8Yk6SGjKEJakhQ1iSGjKEJakhQ1iSGjKEJakhQ1iSGjKEJakhQ1iSGjKEJakhQ1iSGjKEJakhQ1iSGjKEJakhQ1iSGjKEJakhQ1iSGjKEJakhQ1iSGjKEJakhQ1iSGjKEJakhQ1iSGjKEJakhQ1iSGjKEJakhQ1iSGjKEJakhQ1iSGjKEJakhQ1iSGjKEJakhQ1iSGjKEJakhQ1iSGhpYCCdZl+SLSe5KcmeS93btH0xyf5Lbu8eZfe95f5LtSe5Jcnpf+6aubXuSCwdVsyTNtUMHuO2ngV+pqtuSvAj4epKbumUfqar/2r9ykpOAc4CXA8cAn0/yz7vFlwA/B+wCbk2yparuGmDtkjQnBhbCVfUg8GD3/PtJ7gaOneYtZwFXV9VTwI4k24FTu2Xbq+o+gCRXd+sawpKG3pyMCSc5HngV8LWu6YIk25JckWRl13YsMNr3tl1d21TtkjT0Bh7CSZYDnwbeV1WPA5cCLwVOptdT/r1Z+pzzk2xNsnX37t2zsUlJGriBhnCSw+gF8FVV9RmAqnqoqsaq6hngj3luyOF+YF3f24/r2qZq30tVXVZVG6tq4+rVq2d/ZyRpAAZ5dkSAy4G7q+r3+9qP7lvtzcAd3fMtwDlJlibZAJwI3ALcCpyYZEOSJfQO3m0ZVN2SNJcGeXbEa4F3Ad9McnvXdhHwjiQnAwV8C/hFgKq6M8k19A64PQ1srqoxgCQXADcCI8AVVXXnAOuWpDkzyLMj/grIJIuun+Y9FwMXT9J+/XTvk6Rh5RVzktSQISxJDRnCktSQISxJDRnCktSQISxJDRnCktSQISxJDRnCktSQISxJDRnCktSQISxJDRnCktSQISxJDRnCktSQISxJDRnCktSQISxJDRnCktSQISxJDRnCktSQISxJDRnCktSQISxJDRnCktSQISxJDRnCktSQISxJDRnCktSQISxJDRnCktSQISxJDRnCktSQISxJDRnCktSQISxJDQ0shJOsS/LFJHcluTPJe7v2VUluSnJv93Nl154kH02yPcm2JKf0bevcbv17k5w7qJolaa4Nsif8NPArVXUScBqwOclJwIXAzVV1InBz9xrgDODE7nE+cCn0Qhv4APDTwKnAB8aDW5KG3cBCuKoerKrbuuffB+4GjgXOAq7sVrsSOLt7fhbwier5KrAiydHA6cBNVbWnqh4FbgI2DapuSZpLczImnOR44FXA14C1VfVgt+g7wNru+bHAaN/bdnVtU7VP/Izzk2xNsnX37t2zuwOSNCADD+Eky4FPA++rqsf7l1VVATUbn1NVl1XVxqrauHr16tnYpCQN3EBDOMlh9AL4qqr6TNf8UDfMQPfz4a79fmBd39uP69qmapekoTfIsyMCXA7cXVW/37doCzB+hsO5wGf72t/dnSVxGvBYN2xxI/CmJCu7A3Jv6tokaegdOsBtvxZ4F/DNJLd3bRcBHwauSXIe8G3gbd2y64Ezge3AE8B7AKpqT5LfAm7t1vvNqtozwLolac4MLISr6q+ATLH4jZOsX8DmKbZ1BXDF7FUnSfODV8xJUkOGsCQ1ZAhLUkOGsCQ1ZAhLUkOGsCQ1ZAhLUkMzCuEkr51JmyTpwMy0J/wHM2yTJB2Aaa+YS/Jq4DXA6iT/oW/REcDIIAuTpMVgf5ctLwGWd+u9qK/9ceAtgypKkhaLaUO4qr4MfDnJx6vq23NUkyQtGjOdwGdpksuA4/vfU1VvGERRkrRYzDSE/xT4n8DHgLHBlbPw1DNjjI727s60fv16RkYcSpf0nJmG8NNVdelAK1mgnvzeI1x07QMsWXIHl28+gw0bNrQuSdI8MtMQ/rMk/x64DnhqvNHJ1Wdm2co1LF26tHUZkuahmYbw+O2IfrWvrYCXzG45krS4zCiEq8q/oSVpAGYUwknePVl7VX1idsuRpMVlpsMRP9X3/AX07hF3G2AIS9JBmOlwxC/1v06yArh6EAVJ0mLyfKey/EfAcWJJOkgzHRP+M3pnQ0Bv4p5/AVwzqKIWi7GxMXbu3Al4IYe0WM10TPi/9j1/Gvh2Ve0aQD2Lys6dOznvkhsAvJBDWqRmOib85SRree4A3b2DK2lxOXzV2tYlSGpopnfWeBtwC/BW4G3A15I4laUkHaSZDkf8OvBTVfUwQJLVwOeBawdVmCQtBjM9O+KQ8QDufPcA3itJmsJMe8KfS3Ij8Knu9duB6wdTkiQtHvu7x9wJwNqq+tUkPw+8rlv0N8BVgy5Okha6/fWE/xvwfoCq+gzwGYAkP9kt+1cDrE2SFrz9jeuurapvTmzs2o4fSEWStIjsL4RXTLNs2SzWIUmL0v5CeGuSfzuxMckvAF8fTEmStHjsb0z4fcB1Sd7Jc6G7EVgCvHmAdUnSojBtCFfVQ8Brkvws8BNd8/+tqi8MvDJJWgRmdMFFVX2xqv6ge8wogJNckeThJHf0tX0wyf1Jbu8eZ/Yte3+S7UnuSXJ6X/umrm17kgsPZOckab4b5FVvHwc2TdL+kao6uXtcD5DkJOAc4OXde/4wyUiSEeAS4AzgJOAd3bqStCDM9Iq5A1ZVX0ly/AxXPwu4uqqeAnYk2Q6c2i3bXlX3ASS5ulv3rtmuV5JaaDH/wwVJtnXDFSu7tmOB0b51dnVtU7XvI8n5SbYm2bp79+5B1C1Js26uQ/hS4KXAycCDwO/N1oar6rKq2lhVG1evXj1bm5WkgRrYcMRkurMtAEjyx8Cfdy/vB9b1rXpc18Y07ZI09Oa0J5zk6L6XbwbGz5zYApyTZGmSDcCJ9CaRvxU4McmGJEvoHbzbMpc1S9IgDawnnORTwOuBI5PsAj4AvD7JyfRuGvot4BcBqurOJNfQO+D2NLC5qsa67VwA3EjvBqNXVNWdg6pZkubaIM+OeMckzZdPs/7FwMWTtF/PApi7uJ4ZY3S0d4zROytLGufdMebIk997hIuuvY3zLrnh2dvcS9KcHphb7JatXMOSww59tkc8OjpKFSSNC5PUjCE8x3o94gdYcdRuvrvjTpYfcwJLly5tXZakRhyOaGDZyjUsP/IYlv3Yka1LkdSYISxJDRnCktSQISxJDRnCktSQISxJDRnCktSQISxJDRnCktSQISxJDRnCktSQISxJDRnCktSQISxJDRnCktSQISxJDRnCktSQISxJDRnCktSQISxJDRnCktSQISxJDRnCktSQISxJDRnCktSQISxJDRnCktSQISxJDRnCktSQISxJDRnCktSQISxJDQ0shJNckeThJHf0ta1KclOSe7ufK7v2JPloku1JtiU5pe8953br35vk3EHVK0ktDLIn/HFg04S2C4Gbq+pE4ObuNcAZwInd43zgUuiFNvAB4KeBU4EPjAe3JC0EAwvhqvoKsGdC81nAld3zK4Gz+9o/UT1fBVYkORo4HbipqvZU1aPATewb7JI0tOZ6THhtVT3YPf8OsLZ7fiww2rferq5tqnZJWhCaHZirqgJqtraX5PwkW5Ns3b1792xtVpIGaq5D+KFumIHu58Nd+/3Aur71juvapmrfR1VdVlUbq2rj6tWrZ71wSRqEuQ7hLcD4GQ7nAp/ta393d5bEacBj3bDFjcCbkqzsDsi9qWuTpAXh0EFtOMmngNcDRybZRe8shw8D1yQ5D/g28LZu9euBM4HtwBPAewCqak+S3wJu7db7zaqaeLBvQRsbG2Pnzp3Pvl6/fj0jIyMNK5I0mwYWwlX1jikWvXGSdQvYPMV2rgCumMXS5p16ZozR0eeOP/YH7c6dOznvkhs4fNVantjzEJdvPoMNGza0KlXSLBtYCGvmnvzeI1x07QOsOGr3pEF7+Kq1LD/ymIYVShoUQ3ieWLZyjUErLULOHSFJDRnCktSQISxJDRnCktSQISxJDRnCktSQISxJDRnCktSQISxJDRnCktSQly3PM/2T+axfv75xNZIGzRCeZ8Yn81my5A4u33xG63IkDZghPA8tW7mGpUuXti5D0hxwTFiSGjKEJakhQ1iSGjKEJakhD8zNU/2nqlU1LkbSwBjC89T4qWpjTz7O8mNOaF2OpAExhOexZSvXMLZkSesyJA2QY8KS1JAhLEkNGcKS1JAhLEkNGcKS1JAhLEkNeYraAjE2NsbOnTuffb1+/XpGRkYaViRpJgzhBWLnzp2cd8kNHL5qLU/seYjLN5/Bhg0bWpclaT8M4QXk8FVrWX7kMa3LkHQADOEhMvHWRw43SMPPA3NDpDefxG2cd8kNe43/Shpe9oSHjLc+khYWe8KS1JAhLEkNNQnhJN9K8s0ktyfZ2rWtSnJTknu7nyu79iT5aJLtSbYlOaVFzZI0CC17wj9bVSdX1cbu9YXAzVV1InBz9xrgDODE7nE+cOmcVzrPjJ8lsWPHDnbs2MHY2FjrkiQ9T/PpwNxZwOu751cCXwL+U9f+iaoq4KtJViQ5uqoebFLlPDB+140VR+1+9sIMScOpVU+4gL9I8vUk53dta/uC9TvA2u75scBo33t3dW17SXJ+kq1Jtu7evXtQdc8by1auYfmRx3D4qrX7X1nSvNWqJ/y6qro/yRrgpiR/17+wqirJAd3esqouAy4D2Lhxo7fGlDQUmvSEq+r+7ufDwHXAqcBDSY4G6H4+3K1+P7Cu7+3HdW2SNPTmPISTvDDJi8afA28C7gC2AOd2q50LfLZ7vgV4d3eWxGnAY4t5PFjSwtJiOGItcF2S8c//k6r6XJJbgWuSnAd8G3hbt/71wJnAduAJ4D1zX/L81T+fRDkIIw2dOQ/hqroPeOUk7d8F3jhJewGb56C0oTR+psTYk4+z/JgTWpcj6QDNp1PU9DwtW7mGsSVLWpch6XkwhBcp78QhzQ+G8CLlnTik+cEQXoBmOvm7d+KQ2jOEF6Dxg3WHHbqND539CtatW+dwgzRPOZXlArVs5RpyyCHeiUOa5+wJL3AHeyeO/gN49qal2WcILwL9Y8TQC9OZGj+AB3jwThoAQ3gRONipL52pTRocQ/ggTXa+7Xw0PvWlpPnFED5Ik51vK0kzZQjPgmE639YJf6T5xRBeZJzwR5pfDOFFyAl/pPnDizUkqSF7wprxXBOSZp89YXXjxAd3efPY2Bg7duxgx44djI2NzXKF0sJlT1hAb5x4yWGHemWdNMcMYT2r/8q6f3zkQT509iuA3qlsvVsCTs8r66QDZwhrL+NX1j2x5yEuuva2Z09lm6yX7NixdPAMYU2p/1S2yeafcMhBOniG8Cxa6FejjfeSn+/ZFN7XTtqXITyLFsvVaOP7uWTJHQfUI/a+dtK+DOFZtliuRpt4NsXo6Og+B/AmTggPwzXPhjQXDGE9b/3jxN/dcec+B/BGR0f54JY7AKacXW6qqUAdttBiYQjroPSfTQGTB/N0t1eaaipQhy20WBjCmnUTg3l/BywnG6Jw2EKLhSGsgZvsgOVsnkniWRcaZoaw5sTEA5azGcyedaFhZggfgGG5n9ywONBgnq6H6/CFhpUhPAPj4Tt+tN/7yQ3OVMF82KHb+NDZr+CYY3pBOx7GB/I/womnzE0V6DNdT5oNhvA0Jobvk489wvJjTtjnqrGFeHXcfLJs5RrG/vF73VwWX2Jk2RGsOGrdPpMMwfQ955nO9OaMcJpLhvA0xn8Zx8N3Wd+yxXJ13Hwy3kseeeGKSScZgn17zuvWrQP2vlikP6jH5z4eGRnZK7T3NyOcBwM1WwzhCfp/uUZHR1m2cupfxsVyddx8Ntl30N9znmxazonnMo8sO4LDDj302dCe7Oq/iTwYqNliCE/Q/8s1frGBhtNU03JOXDbywhV7hfZkV/9NvFvIAw88wLKVa/c7odGB9pjtYS8+hvAkxo+0j19soOE3k79a9nf138iyIxh78vFnf85kCKT/f+rjPfLxg4vjRkZGng35Bx54YJ+Dv/097PGQnvg/hf5tTBxa2R+Dv62hCeEkm4D/DowAH6uqDzcuSYvAPj3mbkz6QIZAxnvMz/XIv7RXoK84at1eIT/x4G9/uPYfJJ5qG/1DK1Pd72+q4B+ve926ddP26seXeSbJwRuKEE4yAlwC/BywC7g1yZaqumu2PqP/TAjPdtDzNd0QyPjy/iDvP8g4MdyfO/j7pb2Cdvwg8ZTbmDC0MrEHP13wj9fd36ufGNb1zDN7jZ9PbJuulz7RZMtm2qtfKD34oQhh4FRge1XdB5DkauAsYNZCeOfOnbzjt6/kh99/lOVHbSCh19v40Y96/3D7fv5g6dJ9ls20bdDru415tI1lRwDw5KMPH/Q2+j356MP738Yk75vOXjUuO4Iffn8Pv/yxGzniyKN4dNd2Rl6wnLEf/oDlR21g7MnH91o2Wdv4+iMvWD5p23TL+tsOO+wwfvfcNzw7xNNvdHSUX7vyCyz7sRfz5GPfnXK92TKoA6+pIej2JXkLsKmqfqF7/S7gp6vqgr51zgfO717+OHDPAX7MkcAjs1DufON+DRf3a/hMt2+PVNWm6d48LD3h/aqqy4DLnu/7k2ytqo2zWNK84H4NF/dr+Bzsvh0ym8UM0P1A/98Zx3VtkjTUhiWEbwVOTLIhyRLgHGBL45ok6aANxXBEVT2d5ALgRnqnqF1RVXfO8sc876GMec79Gi7u1/A5qH0bigNzkrRQDctwhCQtSIawJDW06EM4yaYk9yTZnuTC1vUcjCTfSvLNJLcn2dq1rUpyU5J7u58rW9c5E0muSPJwkjv62ibdl/R8tPsOtyU5pV3l05tivz6Y5P7ue7s9yZl9y97f7dc9SU5vU/X+JVmX5ItJ7kpyZ5L3du1D/Z1Ns1+z951V1aJ90DvI9w/AS4AlwDeAk1rXdRD78y3gyAltvwtc2D2/EPid1nXOcF9+BjgFuGN/+wKcCdwABDgN+Frr+g9wvz4I/MdJ1j2p+ze5FNjQ/Vsdab0PU+zX0cAp3fMXAX/f1T/U39k0+zVr39li7wk/ezl0Vf0IGL8ceiE5C7iye34lcHa7Umauqr4C7JnQPNW+nAV8onq+CqxIcvScFHqAptivqZwFXF1VT1XVDmA7vX+z805VPVhVt3XPvw/cDRzLkH9n0+zXVA74O1vsIXwsMNr3ehfT/wee7wr4iyRf7y7jBlhbVQ92z78DTH/LiPltqn1ZCN/jBd2f5Vf0DRkN5X4lOR54FfA1FtB3NmG/YJa+s8UewgvN66rqFOAMYHOSn+lfWL2/lxbEOYkLaV+AS4GXAicDDwK/17Sag5BkOfBp4H1V9Xj/smH+zibZr1n7zhZ7CC+oy6Gr6v7u58PAdfT+DHpo/M+87ufD7So8aFPty1B/j1X1UFWNVdUzwB/z3J+vQ7VfSQ6jF1RXVdVnuuah/84m26/Z/M4WewgvmMuhk7wwyYvGnwNvAu6gtz/ndqudC3y2TYWzYqp92QK8uzvifhrwWN+fwPPehLHQN9P73qC3X+ckWZpkA3AicMtc1zcTSQJcDtxdVb/ft2iov7Op9mtWv7PWRx9bP+gdpf17ekcxf711PQexHy+hd1T2G8Cd4/sCvBi4GbgX+DywqnWtM9yfT9H7M++f6I2rnTfVvtA7wn5J9x1+E9jYuv4D3K9PdnVv636Jj+5b/9e7/boHOKN1/dPs1+voDTVsA27vHmcO+3c2zX7N2nfmZcuS1NBiH46QpKYMYUlqyBCWpIYMYUlqyBCWpIYMYS14ST6e3h27SfKxJCd1zy+asN5fH8RnfKmbNWtbkr9L8j+SrDiowrUoGMJaVKrqF6rqru7lRROWveYgN//OqnoF8ArgKYb7whjNEUNY806Sd3c9ym8k+WSS45N8oWu7Ocn6br2Pd3PS/nWS+/p6u+l6ovck+Tywpm/bX0qyMcmHgWXdXLBXdct+0Pf+/5LkjvTmZ3571/767v3Xdr3dq7orqvZSvRn5fg1Yn+SV3Xv/dZJbus/7oyQj3ePjfZ/zy926JyT5fLf/tyV56SD/e6utobjRpxaPJC8HfgN4TVU9kmQVvSkQr6yqK5P8G+CjPDcl4tH0rmp6Gb0rl66ldxnpj9Ob23UtcBdwRf/nVNWFSS6oqpMnKePn6U3M8krgSODWJF/plr0KeDnwAPD/gNcCfzVxA1U1luQbwMuS/Ah4O/DaqvqnJH8IvJPelY3HVtVPdPu+onv7VcCHq+q6JC/AztKC5per+eYNwJ9W1SMAVbUHeDXwJ93yT9IL3XH/p6qe6YYYxqdJ/BngU9WbYOUB4AsHWMPr+t7/EPBl4Ke6ZbdU1a7qTdxyO3D8NNsZ7yW/EfiX9ML89u71S4D7gJck+YMkm4DHu/k/jq2q67r9/2FVPXGA9WuI2BPWsHuq7/k+QwMD/rwxpvgdSjIC/CS9ScDX0OvJv3+S9V4JnA78O+BtwHtnu2DNb/aENd98AXhrkhdD7x5lwF/Tm+EOen/G/+V+tvEV4O3dmOvRwM9Osd4/ddMUTvSXfe9fTa9nPePZy7pt/mdgtKq20ZvA5i1J1ozvU5J/luRI4JCq+jS9IZhTqnf3hl1Jzu7WXZrk8Jl+toaPPWHNK1V1Z5KLgS8nGQP+Fvgl4H8l+VVgN/Ce/WzmOnrDGncBO4G/mWK9y4BtSW6rqndOeP+r6c1IV8CvVdV3krxsP597VZKn6N1f7PN0t8qqqruS/Aa9u54cQm8Gtc3Ak91+jXeGxnvK7wL+KMlvduu+ld7QhRYgZ1GTpIYcjpCkhgxhSWrIEJakhgxhSWrIEJakhgxhSWrIEJakhv4/axr6BZUoPtEAAAAASUVORK5CYII=\n",
      "text/plain": [
       "<Figure size 360x360 with 1 Axes>"
      ]
     },
     "metadata": {
      "needs_background": "light"
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "sns.displot(train_df['conditionDesc'].apply(len))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cd4be3ce-6004-43cb-9696-8dbbe21db47f",
   "metadata": {},
   "source": [
    "# 模型1：TFIDF"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "id": "8fc7bcd3-bb65-4089-9780-6e4eb103397d",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-06-14T08:15:48.654635Z",
     "iopub.status.busy": "2022-06-14T08:15:48.654049Z",
     "iopub.status.idle": "2022-06-14T08:15:48.708922Z",
     "shell.execute_reply": "2022-06-14T08:15:48.707853Z",
     "shell.execute_reply.started": "2022-06-14T08:15:48.654588Z"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "train_text = train_df['diseaseName'] + ' ' + train_df['conditionDesc'] + ' ' + train_df['title'] + ' ' + train_df['hopeHelp']\n",
    "test_text = test_df['diseaseName'] + ' ' + test_df['conditionDesc'] + ' ' + test_df['title'] + ' ' + test_df['hopeHelp']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "id": "6f4fa84b-3f23-4b66-ad1d-a8ffb9a2ff5e",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-06-14T08:16:46.780251Z",
     "iopub.status.busy": "2022-06-14T08:16:46.779667Z",
     "iopub.status.idle": "2022-06-14T08:16:46.790599Z",
     "shell.execute_reply": "2022-06-14T08:16:46.789876Z",
     "shell.execute_reply.started": "2022-06-14T08:16:46.780202Z"
    }
   },
   "outputs": [],
   "source": [
    "train_text = train_text.fillna('')\n",
    "test_text = test_text.fillna('')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "id": "40632239-3e4e-429e-a024-60a8b4c5608a",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-06-14T08:16:47.590318Z",
     "iopub.status.busy": "2022-06-14T08:16:47.589739Z",
     "iopub.status.idle": "2022-06-14T08:16:47.595338Z",
     "shell.execute_reply": "2022-06-14T08:16:47.594169Z",
     "shell.execute_reply.started": "2022-06-14T08:16:47.590272Z"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "import jieba"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "id": "1e07a543-adac-4646-8d4b-9faf65ad90a8",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-06-14T08:16:48.216242Z",
     "iopub.status.busy": "2022-06-14T08:16:48.215687Z",
     "iopub.status.idle": "2022-06-14T08:16:54.907736Z",
     "shell.execute_reply": "2022-06-14T08:16:54.907168Z",
     "shell.execute_reply.started": "2022-06-14T08:16:48.216196Z"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "train_text = [' '.join(jieba.cut(x)) for x in train_text]\n",
    "test_text = [' '.join(jieba.cut(x)) for x in test_text]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "id": "cd9d372b-4303-498a-bac3-16b861c4afa3",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-06-14T08:23:14.903709Z",
     "iopub.status.busy": "2022-06-14T08:23:14.903154Z",
     "iopub.status.idle": "2022-06-14T08:23:15.806014Z",
     "shell.execute_reply": "2022-06-14T08:23:15.804948Z",
     "shell.execute_reply.started": "2022-06-14T08:23:14.903663Z"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "from sklearn.feature_extraction.text import TfidfVectorizer\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "\n",
    "tfidf = TfidfVectorizer().fit(train_text)\n",
    "train_tfidf = tfidf.fit_transform(train_text)\n",
    "test_tfidf = tfidf.transform(test_text)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "id": "c824640c-c1ce-487f-bcb6-15e0f8444314",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-06-14T08:23:45.547848Z",
     "iopub.status.busy": "2022-06-14T08:23:45.547247Z",
     "iopub.status.idle": "2022-06-14T08:23:45.577146Z",
     "shell.execute_reply": "2022-06-14T08:23:45.576584Z",
     "shell.execute_reply.started": "2022-06-14T08:23:45.547799Z"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "clf_i = LogisticRegression()\n",
    "clf_i.fit(train_tfidf, train_df['label_i'])\n",
    "\n",
    "clf_j = LogisticRegression()\n",
    "clf_j.fit(train_tfidf, train_df['label_j'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "id": "4e8c1790-8012-422f-9ee5-825e3dc9a2ff",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-06-14T08:27:51.938948Z",
     "iopub.status.busy": "2022-06-14T08:27:51.938635Z",
     "iopub.status.idle": "2022-06-14T08:27:51.978717Z",
     "shell.execute_reply": "2022-06-14T08:27:51.978160Z",
     "shell.execute_reply.started": "2022-06-14T08:27:51.938924Z"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "test_submit['label_i'] = clf_i.predict(test_tfidf)\n",
    "test_submit['label_j'] = clf_j.predict(test_tfidf)\n",
    "\n",
    "test_submit.to_csv('tfidf_submit.csv', index=None)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b06fd32d-3a60-4f0f-8a3d-7f78b7f37903",
   "metadata": {},
   "source": [
    "# 模型2：BERT"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "921eb168-ef26-40cc-bd28-7f1c954fe285",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-06-14T09:59:23.949987Z",
     "iopub.status.busy": "2022-06-14T09:59:23.949338Z",
     "iopub.status.idle": "2022-06-14T09:59:24.584845Z",
     "shell.execute_reply": "2022-06-14T09:59:24.584248Z",
     "shell.execute_reply.started": "2022-06-14T09:59:23.949938Z"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "import torch\n",
    "from torch import nn\n",
    "from transformers import BertPreTrainedModel\n",
    "from transformers import AutoTokenizer, AutoModelForMaskedLM, AutoConfig, BertModel, AutoModel"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "7713ffe3-9f85-4cd4-b67b-fbc4b5e0bdd7",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-06-14T09:59:24.714213Z",
     "iopub.status.busy": "2022-06-14T09:59:24.713676Z",
     "iopub.status.idle": "2022-06-14T09:59:34.192309Z",
     "shell.execute_reply": "2022-06-14T09:59:34.191288Z",
     "shell.execute_reply.started": "2022-06-14T09:59:24.714165Z"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "tokenizer = AutoTokenizer.from_pretrained(\"hfl/chinese-roberta-wwm-ext\")\n",
    "config = AutoConfig.from_pretrained(\"hfl/chinese-roberta-wwm-ext\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "5566e344-f09d-4ff3-a979-3a63888c5df6",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-06-14T09:59:37.918043Z",
     "iopub.status.busy": "2022-06-14T09:59:37.917432Z",
     "iopub.status.idle": "2022-06-14T09:59:37.924845Z",
     "shell.execute_reply": "2022-06-14T09:59:37.924140Z",
     "shell.execute_reply.started": "2022-06-14T09:59:37.917995Z"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "class XunFeiModel(nn.Module):\n",
    "    def __init__(self, num_labels_i, num_labels_j): \n",
    "        super(XunFeiModel,self).__init__() \n",
    "\n",
    "        #Load Model with given checkpoint and extract its body\n",
    "        self.model = model = AutoModel.from_pretrained(\"hfl/chinese-roberta-wwm-ext\")\n",
    "        self.dropout = nn.Dropout(0.1) \n",
    "        self.classifier_i = nn.Linear(768, num_labels_i)\n",
    "        self.classifier_j = nn.Linear(768, num_labels_j)\n",
    "\n",
    "    def forward(self, input_ids=None, attention_mask=None,labels=None):\n",
    "        outputs = self.model(input_ids=input_ids, attention_mask=attention_mask)\n",
    "        sequence_output = self.dropout(outputs[0]) #outputs[0]=last hidden state\n",
    "    \n",
    "        logits_i = self.classifier_i(sequence_output[:,0,:].view(-1,768))\n",
    "        logits_j = self.classifier_j(sequence_output[:,0,:].view(-1,768))\n",
    "        \n",
    "        return logits_i, logits_j"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "df2c96ff-ff25-4c9b-99c8-e47264f1330f",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-06-14T09:59:38.831502Z",
     "iopub.status.busy": "2022-06-14T09:59:38.830916Z",
     "iopub.status.idle": "2022-06-14T09:59:40.368828Z",
     "shell.execute_reply": "2022-06-14T09:59:40.368268Z",
     "shell.execute_reply.started": "2022-06-14T09:59:38.831448Z"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "train_text = train_df['diseaseName'] + ' ' + train_df['conditionDesc'] + ' ' + train_df['title'] + ' ' + train_df['hopeHelp']\n",
    "test_text = test_df['diseaseName'] + ' ' + test_df['conditionDesc'] + ' ' + test_df['title'] + ' ' + test_df['hopeHelp']\n",
    "train_text = train_text.fillna('')\n",
    "test_text = test_text.fillna('')\n",
    "\n",
    "train_encoding = tokenizer(train_text.tolist()[:-1000], truncation=True, padding=True, max_length=200)\n",
    "val_encoding = tokenizer(train_text.tolist()[-1000:], truncation=True, padding=True, max_length=200)\n",
    "test_encoding = tokenizer(test_text.tolist(), truncation=True, padding=True, max_length=200)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "345a8281-df40-49c5-9111-d77a78179ff6",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-06-14T09:59:47.760993Z",
     "iopub.status.busy": "2022-06-14T09:59:47.760404Z",
     "iopub.status.idle": "2022-06-14T09:59:47.770786Z",
     "shell.execute_reply": "2022-06-14T09:59:47.769869Z",
     "shell.execute_reply.started": "2022-06-14T09:59:47.760943Z"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "from torch.utils.data import Dataset, DataLoader, TensorDataset\n",
    "\n",
    "# 数据集读取\n",
    "class XunFeiDataset(Dataset):\n",
    "    def __init__(self, encodings, label_i, label_j):\n",
    "        self.encodings = encodings\n",
    "        self.label_i = label_i\n",
    "        self.label_j = label_j\n",
    "    \n",
    "    # 读取单个样本\n",
    "    def __getitem__(self, idx):\n",
    "        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}\n",
    "        item['label_i'] = torch.tensor(int(self.label_i[idx]))\n",
    "        item['label_j'] = torch.tensor(int(self.label_j[idx]))\n",
    "        return item\n",
    "    \n",
    "    def __len__(self):\n",
    "        return len(self.label_i)\n",
    "\n",
    "train_dataset = XunFeiDataset(train_encoding, \n",
    "                              train_df['label_i'].values[:-1000], \n",
    "                              train_df['label_j'].values[:-1000])\n",
    "val_dataset = XunFeiDataset(val_encoding, \n",
    "                              train_df['label_i'].values[-1000:], \n",
    "                              train_df['label_j'].values[-1000:])\n",
    "test_dataset = XunFeiDataset(test_encoding, [0] * len(test_df), [0] * len(test_df))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "7457818c-5f37-4404-988a-1f9c1f9fe56c",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-06-14T09:59:53.711662Z",
     "iopub.status.busy": "2022-06-14T09:59:53.711288Z",
     "iopub.status.idle": "2022-06-14T09:59:53.716860Z",
     "shell.execute_reply": "2022-06-14T09:59:53.715704Z",
     "shell.execute_reply.started": "2022-06-14T09:59:53.711630Z"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "# 单个读取到批量读取\n",
    "train_loader = DataLoader(train_dataset, batch_size=16, shuffle=False)\n",
    "val_dataloader = DataLoader(val_dataset, batch_size=16, shuffle=False)\n",
    "test_dataloader = DataLoader(test_dataset, batch_size=16, shuffle=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "a36facf7-fb9f-4fc6-ab02-385f36ad0017",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-06-14T09:59:55.410672Z",
     "iopub.status.busy": "2022-06-14T09:59:55.410263Z",
     "iopub.status.idle": "2022-06-14T10:00:00.120548Z",
     "shell.execute_reply": "2022-06-14T10:00:00.119498Z",
     "shell.execute_reply.started": "2022-06-14T09:59:55.410639Z"
    },
    "tags": []
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Some weights of the model checkpoint at hfl/chinese-roberta-wwm-ext were not used when initializing BertModel: ['cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias']\n",
      "- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).\n",
      "- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).\n"
     ]
    }
   ],
   "source": [
    "model = XunFeiModel(20, 61)\n",
    "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n",
    "# device = 'cpu'\n",
    "model = model.to(device)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "6e09ac6d-1230-49f1-a7a0-43a0cad6ab62",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-06-14T10:00:02.671370Z",
     "iopub.status.busy": "2022-06-14T10:00:02.670780Z",
     "iopub.status.idle": "2022-06-14T10:00:02.677703Z",
     "shell.execute_reply": "2022-06-14T10:00:02.676923Z",
     "shell.execute_reply.started": "2022-06-14T10:00:02.671323Z"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "from torch.nn import CrossEntropyLoss\n",
    "from torch.optim import AdamW\n",
    "\n",
    "loss_fn = CrossEntropyLoss()\n",
    "optim = AdamW(model.parameters(), lr=5e-5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "27bb9435-f9e4-470f-9df5-77a685dd5ecb",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-06-14T10:00:04.551775Z",
     "iopub.status.busy": "2022-06-14T10:00:04.551200Z",
     "iopub.status.idle": "2022-06-14T10:00:04.570826Z",
     "shell.execute_reply": "2022-06-14T10:00:04.570117Z",
     "shell.execute_reply.started": "2022-06-14T10:00:04.551745Z"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "def train():\n",
    "    model.train()\n",
    "    total_train_loss = 0\n",
    "    iter_num = 0\n",
    "    total_iter = len(train_loader)\n",
    "    for batch in train_loader:\n",
    "        # 正向传播\n",
    "        optim.zero_grad()\n",
    "        \n",
    "        input_ids = batch['input_ids'].to(device)\n",
    "        attention_mask = batch['attention_mask'].to(device)\n",
    "        label_i = batch['label_i'].to(device)\n",
    "        label_j = batch['label_j'].to(device)\n",
    "\n",
    "        pred_i, pred_j = model(\n",
    "            input_ids, \n",
    "            attention_mask\n",
    "        )\n",
    "        \n",
    "        valid = label_j != -1\n",
    "        loss = loss_fn(pred_i, label_i)  + loss_fn(pred_j[valid], label_j[valid])\n",
    "        # 反向梯度信息\n",
    "        loss.backward()\n",
    "        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)\n",
    "        \n",
    "        # 参数更新\n",
    "        optim.step()\n",
    "\n",
    "        iter_num += 1\n",
    "        \n",
    "        if(iter_num % 100 == 0):\n",
    "            print(\"iter_num: %d, loss: %.4f, %.2f%% %.4f / %.4f\" % (\n",
    "                iter_num, loss.item(), iter_num/total_iter*100, \n",
    "                (pred_i.argmax(1) == label_i).float().data.cpu().numpy().mean(),\n",
    "                (pred_j[valid].argmax(1) == label_j[valid]).float().data.cpu().numpy().mean()\n",
    "            ))\n",
    "\n",
    "def validation():\n",
    "    model.eval()\n",
    "    label_i_acc, label_j_acc = 0, 0\n",
    "    for batch in val_dataloader:\n",
    "        with torch.no_grad():\n",
    "            input_ids = batch['input_ids'].to(device)\n",
    "            attention_mask = batch['attention_mask'].to(device)\n",
    "            label_i = batch['label_i'].to(device)\n",
    "            label_j = batch['label_j'].to(device)\n",
    "\n",
    "            pred_i, pred_j = model(\n",
    "                input_ids, \n",
    "                attention_mask\n",
    "            )\n",
    "    \n",
    "            valid = label_j != -1\n",
    "            label_i_acc += (pred_i.argmax(1) == label_i).float().sum().item()\n",
    "            label_j_acc += (pred_j[valid].argmax(1) == label_j[valid]).float().sum().item()\n",
    "    \n",
    "    label_i_acc = label_i_acc / len(val_dataloader.dataset)\n",
    "    label_j_acc = label_j_acc / len(val_dataloader.dataset)\n",
    "\n",
    "    print(\"-------------------------------\")\n",
    "    print(\"Accuracy: %.4f / %.4f\" % (label_i_acc, label_j_acc))\n",
    "    print(\"-------------------------------\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "15cbae55-44e6-442e-b10c-f3ce77444873",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-06-14T10:00:42.318298Z",
     "iopub.status.busy": "2022-06-14T10:00:42.317709Z",
     "iopub.status.idle": "2022-06-14T10:00:42.327209Z",
     "shell.execute_reply": "2022-06-14T10:00:42.326545Z",
     "shell.execute_reply.started": "2022-06-14T10:00:42.318248Z"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "def prediction():\n",
    "    model.eval()\n",
    "    test_label_i = []\n",
    "    test_label_j = []\n",
    "    for batch in test_dataloader:\n",
    "        with torch.no_grad():\n",
    "            input_ids = batch['input_ids'].to(device)\n",
    "            attention_mask = batch['attention_mask'].to(device)\n",
    "            label_i = batch['label_i'].to(device)\n",
    "            label_j = batch['label_j'].to(device)\n",
    "\n",
    "            pred_i, pred_j = model(input_ids, attention_mask)\n",
    "            test_label_i += list(pred_i.argmax(1).data.cpu().numpy())\n",
    "            test_label_j += list(pred_j.argmax(1).data.cpu().numpy())\n",
    "    return test_label_i, test_label_j"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "25f447b8-8e12-4b32-b641-bae018d00d50",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-06-14T10:00:45.359830Z",
     "iopub.status.busy": "2022-06-14T10:00:45.359242Z",
     "iopub.status.idle": "2022-06-14T10:16:57.657378Z",
     "shell.execute_reply": "2022-06-14T10:16:57.656081Z",
     "shell.execute_reply.started": "2022-06-14T10:00:45.359779Z"
    },
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "iter_num: 100, loss: 3.6886, 7.32% 0.6875 / 0.2000\n",
      "iter_num: 200, loss: 2.2102, 14.63% 0.6875 / 0.6250\n",
      "iter_num: 300, loss: 1.3897, 21.95% 0.8750 / 0.8000\n",
      "iter_num: 400, loss: 1.8933, 29.26% 0.8750 / 0.6364\n",
      "iter_num: 500, loss: 4.0554, 36.58% 0.7500 / 0.1429\n",
      "iter_num: 700, loss: 1.2060, 51.21% 1.0000 / 0.5000\n",
      "iter_num: 800, loss: 2.3983, 58.52% 0.8125 / 0.6000\n",
      "iter_num: 900, loss: 2.0160, 65.84% 0.7500 / 0.5714\n",
      "iter_num: 1000, loss: 1.2156, 73.15% 0.8125 / 0.8750\n",
      "iter_num: 1100, loss: 2.0317, 80.47% 0.6250 / 0.6250\n",
      "iter_num: 1200, loss: 1.5509, 87.78% 0.8125 / 0.7143\n",
      "iter_num: 1300, loss: 1.1967, 95.10% 0.8750 / 0.7000\n",
      "-------------------------------\n",
      "Accuracy: 0.8190 / 0.4010\n",
      "-------------------------------\n",
      "iter_num: 100, loss: 1.0110, 7.32% 0.9375 / 0.8000\n",
      "iter_num: 200, loss: 2.0781, 14.63% 0.7500 / 0.6250\n",
      "iter_num: 300, loss: 0.3355, 21.95% 1.0000 / 1.0000\n",
      "iter_num: 400, loss: 0.4487, 29.26% 0.9375 / 0.9091\n",
      "iter_num: 500, loss: 2.8435, 36.58% 0.8125 / 0.4286\n",
      "iter_num: 600, loss: 1.0412, 43.89% 0.8750 / 0.7273\n",
      "iter_num: 700, loss: 0.6170, 51.21% 1.0000 / 0.6667\n",
      "iter_num: 800, loss: 2.3933, 58.52% 0.8125 / 0.6000\n",
      "iter_num: 900, loss: 0.9306, 65.84% 0.8125 / 0.8571\n",
      "iter_num: 1000, loss: 0.4666, 73.15% 0.8750 / 1.0000\n",
      "iter_num: 1100, loss: 2.0093, 80.47% 0.7500 / 0.7500\n",
      "iter_num: 1200, loss: 1.2239, 87.78% 0.8750 / 0.7143\n",
      "iter_num: 1300, loss: 0.6690, 95.10% 0.8750 / 1.0000\n",
      "-------------------------------\n",
      "Accuracy: 0.8170 / 0.4120\n",
      "-------------------------------\n",
      "iter_num: 100, loss: 1.4034, 7.32% 0.8125 / 0.7000\n",
      "iter_num: 200, loss: 1.1768, 14.63% 0.8125 / 0.8750\n",
      "iter_num: 300, loss: 0.3112, 21.95% 0.9375 / 1.0000\n",
      "iter_num: 400, loss: 0.2014, 29.26% 1.0000 / 1.0000\n"
     ]
    },
    {
     "ename": "KeyboardInterrupt",
     "evalue": "",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mKeyboardInterrupt\u001b[0m                         Traceback (most recent call last)",
      "\u001b[0;32m<ipython-input-20-82473913e410>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mepoch\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mrange\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;36m10\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m     \u001b[0mtrain\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m      3\u001b[0m     \u001b[0mvalidation\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      4\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      5\u001b[0m \u001b[0mtest_pred_i\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mtest_pred_j\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mprediction\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m<ipython-input-17-7f0ec38fcfe5>\u001b[0m in \u001b[0;36mtrain\u001b[0;34m()\u001b[0m\n\u001b[1;32m     21\u001b[0m         \u001b[0mloss\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mloss_fn\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mpred_i\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mlabel_i\u001b[0m\u001b[0;34m)\u001b[0m  \u001b[0;34m+\u001b[0m \u001b[0mloss_fn\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mpred_j\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mvalid\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mlabel_j\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mvalid\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     22\u001b[0m         \u001b[0;31m# 反向梯度信息\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 23\u001b[0;31m         \u001b[0mloss\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mbackward\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m     24\u001b[0m         \u001b[0mtorch\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mnn\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mutils\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mclip_grad_norm_\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mmodel\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mparameters\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;36m1.0\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     25\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m~/.local/lib/python3.6/site-packages/torch/tensor.py\u001b[0m in \u001b[0;36mbackward\u001b[0;34m(self, gradient, retain_graph, create_graph)\u001b[0m\n\u001b[1;32m    219\u001b[0m                 \u001b[0mretain_graph\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mretain_graph\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    220\u001b[0m                 create_graph=create_graph)\n\u001b[0;32m--> 221\u001b[0;31m         \u001b[0mtorch\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mautograd\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mbackward\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mgradient\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mretain_graph\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcreate_graph\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    222\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    223\u001b[0m     \u001b[0;32mdef\u001b[0m \u001b[0mregister_hook\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mhook\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m~/.local/lib/python3.6/site-packages/torch/autograd/__init__.py\u001b[0m in \u001b[0;36mbackward\u001b[0;34m(tensors, grad_tensors, retain_graph, create_graph, grad_variables)\u001b[0m\n\u001b[1;32m    130\u001b[0m     Variable._execution_engine.run_backward(\n\u001b[1;32m    131\u001b[0m         \u001b[0mtensors\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mgrad_tensors_\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mretain_graph\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcreate_graph\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 132\u001b[0;31m         allow_unreachable=True)  # allow_unreachable flag\n\u001b[0m\u001b[1;32m    133\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    134\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;31mKeyboardInterrupt\u001b[0m: "
     ]
    }
   ],
   "source": [
    "for epoch in range(2):\n",
    "    train()\n",
    "    validation()\n",
    "    \n",
    "test_pred_i, test_pred_j = prediction()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "c0c2433c-689d-40d3-8eb6-3850c27da50b",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-06-14T10:17:02.919979Z",
     "iopub.status.busy": "2022-06-14T10:17:02.919391Z",
     "iopub.status.idle": "2022-06-14T10:17:48.475116Z",
     "shell.execute_reply": "2022-06-14T10:17:48.474206Z",
     "shell.execute_reply.started": "2022-06-14T10:17:02.919928Z"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "test_pred_i, test_pred_j = prediction()\n",
    "\n",
    "test_submit['label_i'] = test_pred_i\n",
    "test_submit['label_j'] = test_pred_j\n",
    "\n",
    "test_submit.to_csv('bert_submit.csv', index=None)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ddf3ee10-993d-4eaa-9336-db8c3a478b96",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3.6 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  },
  "toc-autonumbering": true,
  "toc-showmarkdowntxt": false
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
